Saturday, March 23, 2013

Building a git network visualization tool


I always liked the Git network viewer (i.e. see this example). In the "early git days", I found it very useful to have a visible return of the git operations I was performing.
As today I found it more clear than the regular vertical log viewer.
Unfortunately it seems that only github provides a similar functionality, and of course it works only on remote repositories.

So, I decided to write a similar tool by myself :) I picked up Qt 5.0 and libgit2 and built something from scratch, in a few hours.
I thought it was a relatively simple operation, but in the end I spent a good amount of time fighting with QML performance. At the moment I just reached a decent level of performance (at least for a first version), and after I complete the GUI with a basilar set of options, I will publish it as OSS.

After the first version, I plan to add navigation functionalities the github viewer doesn't have and also a "graphs" page. This kind of visualization it's not really nice in repositories where many branch are involved, but I think I can find a way to display nicely these too...

So, stay tuned :)



Monday, February 4, 2013

Fun with composition and interfaces

As one of most important programming concepts is code reuse, let's see some nice ways to use C++ templates and make code-reuse simple and efficient. Here it follows a bunch of techniques that have proven useful in my real life programming.

Suppose you have a set of unrelated classes, and you want to add a common behavior to each of these.

class A : public A_base
{

};

class B : public B_base
{

}
...
 
you want to add a set of common data and functions to each of these class. These can be extension functions (like loggers,benchmarking classes, etc...), or algorithm traits (see below).

class printer
{
public:
    void print ()
    {
      std::cout << "Print " << endl;
    }
};
The simplest way to accomplish this is using multiple inheritance:
 
class A : public A_base , public printer
 {

 }

That's easy , but the printer class won't be able to access any of the A' members. In this way it is possible to compose only classes which are independent from each other.

Composition by hierarchy insertion

Suppose we are willing to access a class member to perform an additional operation. A possible way is to insert the extension class in the hierarchy.
 template <class T>
 class printer : public T
 {
  public:
    void print ()
    {
         std::cout << "Class name " << this->name() << endl;
    }
 }

The class printer now relies on the "std::string name()" function being available on it's base class. This kind of requirement is quite common on template classes, and until we get concepts we must pay attention that the required methods exists on the extending classes.
BTW, type traits could eventually be used in place of direct function calls.
The class can be inserted in a class hierarchy and the derived classes can access the additional functionality.
 
class A_base
{
public:
    std::string name () { return "A_base class";}
}  

class A : public printer<A_base>
{

} ;

int main ()
{
    A a;
    a.print();
    return 0;
}

this technique can be useful to compose algorithms that have different traits, and share code between similar classes that don't have a common base.
This last examples is not a really good one for multiple reasons:
  • In case of complex A_base constructor, it's arguments should be forwarded by the printer class. Using C++11 delegating constructors will make things easier, but in C++98 you'll have to manually forward each A_base constructor , making the class less generic.
  • inserting a class in a hierarchy can be unwanted.
  • If you want to access A (not A_base) members from the extension class, you need to add another derived class, deriving from printer<A>
Anyway, this specific kind of pattern can still be useful to reuse virtual function implementations:
class my_interface
{
public:
   virtual void functionA () = 0;
   virtual void functionB () = 0;
};

template <class Base>
class some_impl : public Base
{
     void functionA () override;
};

class my_deriv1 : public some_impl<my_interface>
{
   void functionB() override;
};

In particular, if my_base is an interface, some_impl can be used to reuse a piece of the implementation.

Using the Curiously recurring template pattern.

Now it comes the nice part: to overcome the limitations of the previous samples,a nice pattern can be used: the Curiously recurring template pattern.

template <class T>
class printer
{
public:
   void print ()
   {
       std::cout << (static_cast<T*>(this))->name() << std::endl;
   }
};

class A : public A_base, public printer<A>
{
public:
   std::string name ()
   {
       return "A class";
   }
};

int main ()
{
     A a;
     a.print ();
     return 0;
}

Let's analyze this a bit: the printer class is still a template, but it doesn't derive from T anymore.
Instead, the printer implementation assumes that the printer class will be used in a context where it is convertible to T*, and will have access to all T members.
This is the reason of the static cast to (T*) in the code.

If you look at the code of the printer class alone, this question arises immediately:
how comes that a class unrelated to T can be statically casted to T* ?

The answer is that you don’t have look at the printer class “alone” : template classes and functions are instantiated on the first usage.
When the print() function is called, the template is instantiated. At that point the compiler already knows that A is derived from printer<A> so the static cast can be performed like any down cast.

As you can see, with this idiom you can extend any class and even access it's members from the extending functions.
You may have noticed that the extension class can only access public members of the extending class. To avoid this , the extension class must be made friend:
template <class T>

class printer
{
 T* thisClass() { return (static_cast<T*>(this)) };

public:

   void print ()
   {
       std::cout << thisClass()->name() << std::endl;
   }
};

class A : public A_base, public printer<A>
{
  friend class printer<A>;

private:

   std::string name ()
   {
       return "A class";
   }
};

I've also added a thisClass() utility function to simplify the code and place the cast in one place (const version left to the reader).

Algorithm composition

This specific kind of pattern can be used to create what I call “algorithm traits”, and it’s one of the ways I use it in real code.

Suppose you have a generic algorithm which is composed by two or more parts. Suppose also that the data is shared and is eventually stored in another class (as usual endless combinations are possible). Here I’ll make a very simple example, but I’ve used it to successfully compose complex CAD algorithms:
template <class T>
class base1
{
protected:
    std::vector Data();
    void fillData ();
};

template <class T>
class phase1_A
{
protected:
    void phase1();
};

template <class T>
class phase1_B
{
protected:
    void phase1();
};

template <class T>
class phase2_A()
{
protected:
   void phase2();
};

template <class T>
class phase2_B()
{
protected:
   void phase2();
};

template <class T>
class algorithm
{
public:
    void run ()
    {
          fillData();
          phase1();
          phase2();
    }
};

// this would be the version using the “derived class” technique
//public class UserA : public Algorithm<phase2_a<phase1_a<base1>>>;

class comb1 : public algorithm<comb1>,phase1_A<comb1>, phase2_A<comb1>,base1<comb1> {};
class comb2 : public algorithm<comb2>,phase1_B<comb2>, phase2_A<comb2>,base1<comb2> {};
class comb3 : public algorithm<comb3>,phase1_A<comb3>, phase2_B<comb3>,base1<comb3> {};
class comb4 : public algorithm<comb4>,phase1_B<comb4>, phase2_B<comb4>,base1<comb4> {};
...
This technique is useful when the algorithms heavily manipulate member data, and functional style programming could not be efficient (input->copy->output). Anyway.... it's just another way to combine things.
A small note on performance: the static_casts in the algorithm pieces will usually require a displacement operation on this (i.e a subtraction), while using a hierarchy usually results in a NOP.
This technique can also be mixed with virtual functions and eventually the algorithm implemented in a base class while the function overrides composed in the way I just exposed.

Interfaces

As we saw, extension methods allow to reuse specific code in unrelated extending classes. In high level C++, the same thing is often accomplished with interfaces:
class IPrinter
{
public:
  virtual void print () = 0;
};

class A : public A_base , public IPrinter
{
public:
 void print () override { std::cout << "A class" << std::endl; }
};

in this case each class that implements an interface requires to re-implement the code in a specific way. The (obvious) advantage of using interfaces is that instances can be used in an unique way, independently from the implementing class.
void use_interface(IPrinter * i)
{
  i->print();
}

A a;
B b; // unrelated to a
use_interface(&a);
use_interface(&b);

this it is not possible with the techniques of the previous sections, since even if the template class is the same, the instantiated classes are completely unrelated.
Of course one could make use_interface a template function too. Surely it can be a way to go, specially if you are writing code heavily based on templates. In this case I would like to find an high-level way, and reduce template usage (consequently code bloat and compilation times too).
class IPrinter
{
public:
  virtual void print () = 0;
};

template <class T>
class Implementor : public IPrinter
{
    T* thisClass() { return (static_cast<T*>(this)) };

public:
  void print () override
  {
      return thisClass()->name ();
  }
};

the Implementor class implements the IPrinter interface using the composition technique explained before and expects the name() function to be present in the user class.
It can be used in this simple way:
class A : public A_Base, public Implementor
{
    std::string name () {return "A";}
}

int main ()
{
    A a;
    IPrinter * intf = &a;
    intf->print();
    use_interface(intf);
    return 0;
}
Some notes apply:
  • This kind of pattern is useful when you have a common implementation of an interface, that depends only in part on the combined class ( A in this case )
  • Since A derives from Implementor, it's also true that A impelemnts IInterface; up-casting from A to IInterface is allowed.
  • Even if Implementor doesn't have data members, A size is increased due to presence of the IPrinter VTable.
  • Using interfaces allows to reduce code bloat in the class consumers, since every Implementor derived class can be used as (IPrinter *)
    There's a small performance penalty though, caused by virtual function calls and increased class size.
  • The benefit is that virtual function calls will be used only when using the print function thought an IPrinter pointer. If called directly, static binding will be used instead. This can be true even for references, if the C++11 final modified is added to the Implementor definition.
 void print () final override;
A a;
A & ref = a;
IPrinter * intf = &a;
a.print ();        // static binding
ref.print();       // dynamic binding (static if declared with final)
intf->print (); // dynamic binding;

This kind of composition doesn't have limits on the number of interfaces.
class mega_composited : public Base , public Implementor<mega_composited>, public OtherImplementor<mega_composited>.....
{

};


Adapters

These implementors can be seen as a sort of adapters between your class and the implemented interfaces. This means that the adaptor can also be an external class. In this case you will need to pass the pointer to the original class in the constructor.
template <class T>
class Implementor : public IPrinter
{
    T* thisClass;

public:
  Implementor (T * original): thisClass(cls)
  {
  }

  void print () override
  {
      return thisClass->name ();
  }
};
note that thisClass is now a member and is initialized in the constructor.
 ...
 A a;
 Implementor impl(&a);
 use_interface(&impl);
as you see , the implementor is used as an adapter between A and IPrinter. In this way class A won't contain the additional member functions.
Note: memory management has been totally omitted from this article. Be careful with these pointers in real code!
Eventually one can make the object convertible to the interface but keeping it as object member (sort of COM aggregation).
class A
{
public:

   Implementor<A> printer_impl;
   /*explicit */ operator IPrinter * { return &printer_impl; }

   A () : printer_impl(this) {};
};
or even lazier...
class A
{
public:
     unique_ptr<Implementor<A>> printer_impl;

     /* explicit */ operator IPrinter * () {
       if (printer_impl.get() != nullptr)
          printer_impl.reset(new Implementor(this));
       return &printer_impl;
     }    
};


I will stop here for now. C++ let programmers compose things in many interesting ways, and obtain an high degree of code reuse without loosing performance.
This kind of composition is near the original intent of templates, i.e. a generic piece of code that can be reused without having to use copy-and-paste! Nice :)

Friday, February 1, 2013

Write a C++ blog they said...

... it will be fun they said! :)
Indeed I already wrote two more articles, but it's the editing part that is time consuming:
  • Read the article over and over and make sure the English is good enough.
  • Do proper source code formatting
  • Check that the code actually compiles and works
  • Make sure that the whole article makes sense, and so the smaller parts.
all of this can double the time required to write the original article text.
On the meantime I'll try to do smaller updates on smaller topics

So,next time we'll have some fun with interfaces and class composition! Stay tuned!

Sunday, January 6, 2013

Unicode and your application (5 of 5)


Other parts: Part 1 , Part 2 , Part 3 , Part 4

Here it comes the last installment on this series: a quick discussion on output files, then a resume on the possible choices using the C++ language.

Output files

What we actually saw for input files applies as well to output files.
Whenever your application needs to save a text output file for the user, the rules of thumb are the following:

  • Allow the user to select an output encoding the files he's going to save.
  • Consider a default encoding in the program's configuration options, and allow the user to change it using the output phase.
  • If your program loads and saves the same text file, save the original encoding and propose it as default.
  • Warn the user if the conversion is lossy(e.g. when converting from UTF8 to ANSI)

Of course this doesn't apply for files stored in the internal application format. In this case it's up to you to choose the preferred encoding.

The steps of text-file saving are the mirrored steps of text-file loading:
  1. Handle the user interaction and encoding selection (with the appropriate warnings)
  2. Convert the buffer from your internal encoding to the output encoding
  3. Send/save the byte buffer to the output device
I won't bother with the pseudo-code to do these specific steps, instead I have updated the unicode samples on github with "uniconv" , an utility that is an evolution of the unicat one: a tool that let's you convert a text file to a different encoding.
  uniconv input_file output_file [--in_enc=xxxx] --out_enc=xxxx [--detect-bom] [--no-write-bom]
it basically let's you choose a different encoding both for input and output files.

To BOM or not to BOM?


When saving an UTF text file, it's really appropriate to save the BOM in the first bytes of the text file. This will allow other programs to automatically detect the encoding of the file (see previous part).
The unicode standard discourages the usage of the BOM in UTF8 text files. Anyway, at least in Windows, using an UTF8 BOM is the only way to automatically detect an UTF8 text file from an "ANSI" one.
So, the choice it's up to you, depending on how and where the generated files will be used.
Personally, I tend to prefer the presence of BOM, to leave all the ambiguities behind.
Edit: since the statements above seem to be a personal opinion, and I don't have enough arguments neither pro or against storing an UTF8 BOM, I let the reader find a good answer on his own! I promise I'll come back to this topic.

You have to make a choice

As we have seen in the previous article, there are a lot of differences between systems, encodings and compilers. Anyway, you still need to handle strings and manipulate them. So, what's the most appropriate choice for character and string types?
Posing such a question in a forum or StackOverflow, could easily generate a flame war :) This is one of the decisions that depends on a wide quantity of factors and there's not a definitive choice valid for everyone.

Choosing a string and encoding type doesn't mean you have to use an UNICODE encoding at all, also it doesn't mean that you have to use a single type of encoding for all the platforms you are porting your application in.

The important thing is that this single choice is propagated coherently across your program.

Basically you have to decide about three fundamental aspects:
  1. Whether using standard C++ types and functions or an existing framework
  2. The character and string type you will be using
  3. The internal encoding for your strings
Choosing an existing framework will often force choices 2) and 3).
For instance, by choosing Qt you will automatically forced to use QChar and UTF-16.
Choices 2) and 3) are strictly related, and can be inverted. Indeed, choosing a data-type will force the encoding you use, depending on the word size. In alternative one can choose a specific encoding and the data-type will be choosed by consequence.

Depending on how much portable your software needs to be, you can choose between:
  • a narrow character type (char) and the relative std::string type
  • a wide character type (wchar) and the relative std::wstring type
  • a new C++11 character type
  • a character type depending on the compilation system
Here it follows a quick comparison between these various choices.

Choosing a narrow character type

This means choosing the char/std::string pair.
Pros:
  • it's the most supported at library level, and widely used in existing programs and libraries.
  • the data type can be adapted to a variety of encodings, both with fixed and variable length (eg Latin1 or UTF8).
Cons:
  • In Windows you can only use fixed length encodings , UTF8 is not supported unless you do explicit conversions.
  • In linux the internal encoding can vary between systems, and UTF8 is just one of the choices. Indeed you can have a fixed and variable encoding depending on the locale.

Let me mark once again that choosing char in a Windows platform is a BAD choice, since your program will not support unicode at all.

Choosing a wide character type

This means using wchar_t and std::wstring.

Pros:
  • wide character types are actually more unicode-friendly and less ambiguous than the char data type.
  • Windows works better( is built on!) with wide-character strings.
Cons:
  • the wchar_t type has different sizes between systems and compilers , and the encoding will change accordingly.
  • Library support is more limited , some functions are missing between standard library implementations.
  • Wide characters are not really widespread, and existing libraries that choose the "char" data type can require character type conversions and cause troubles.
  • Unixes "work better" with the "char" data type.

Choosing a character type depending on the compilation system

This means that the character type is #defined at compilation type and it varies between system.
Pros:
  • the character type will adapt to the "preferred" one of the system. For instance, it can be wchar_t on windows and char on unixes
  • You are sure that the character type is well supported by the library functions too
Cons:
  • you have to think in a generic way and make sure that all the functions are available for both data types
  • Not many library support a generic data type, and the usage is not widespread in unixes, more in windows
  • You will have to map with a define all the functions you are using.
Have you ever met the infamous tchar type in Windows? it is defined as
#ifndef UNICODE
  #define TCHAR char
#else
  #define TCHAR wchar_t
#end
the visual C++ library also defines a set of generic "C" library functions, that map to the narrow or wide version.
This technique can be used between different systems too , and indeed works good. The bad thing is that you will have to mark all your literals with a macro that also maps to char or wchar version.
#if !defined(WIN32) || !defined(UNICODE)
  #define tchar char
  #define tstring std::string
#else
  #define tchar wchar_t
  #define tstring std::wstring
#end

tstring t = _T("Hello world");

I never have seen using this approach outside Windows, but I have used it sometimes and it's a viable choice, specially if you don't do too much string manipulations.

Choosing the new C++11 types

Hei, nice idea. This means using char16_t and char32_t (and the relative u16string and u32string).
Pros:
  • You will have data-types that have fixed sizes between systems and compilers.
  • You only write and test your code once.
  • Library support will likely improve in this direction.
Cons:
  • As today, library support is lacking
  • Operating systems don't support these data types, so you will need to do conversions anyway to make functions calls.
  • The UTF8 strings still use the char data type and std::string, increasing ambiguity (see previous chapters).


Making a choice,the opposite way

As we have seen in the part above, making the choice about the data type will lead you to different encodings in different runtimes and systems.
The opposite way to take a decision is choosing an encoding and then infer the data-type from it. For instance , one can choose UTF8 and select the data type consequently.
This kind of choice is "against" the common C/C++ design and usage,because as we have seen, encodings and data type tend to be varying between platforms.
Still,I can say that is probably a really good choice, and the choice that many "existing frameworks" took.
Pros:
  • You write and test code only once (!)
Cons:
  • Forget about the C/C++ libraries unless you are willing to do many conversions depending on the system (and loosing all the advantages).
  • This kind of approach will likely require to use custom data types
  • Libraries written using standard C++ will require many conversions between string types.
In this regard,let's consider three possible choices:
UTF8:
  • You use the char data type and be compatible. I suggest not doing so to avoid ambiguity.
  • In windows it is not supported, so you will require to convert to UTF16/wchar_t inside your API calls
  • In unixes that support UTF8 locales, it works out-of-the-box.
UTF16:
  • If you have a C++11 compiler you can use a predefined datatype, else you will have invent one of your own that is 16bits on all systems
  • In windows , all the APIs work out of the BOX, in linux you will have to convert to the current locale (hopefully UTF8).
UTF32:
  • Again, if you have C++11 you can use a predefined datatype, else you will have invent one of your own that is 32bits on all systems
  • On any system you will have to do conversions to make system calls.
This approach, taken by frameworks such as Qt, .NET , etc... requires a good amount of code to be written. Not only they choose an encoding for the strings, but also these contain a good number of supporting functions, streams, conversions,etc...

Finally choosing

All of this seems like a cul de sac :)
To resume, one has either to choose a C++ way of doing things that is very variable between systems, or a "fixed encoding" way forced to use existing frameworks.

Fortunately the C++ genericity can abstract the differences, and hopefully standard libraries will improve unicode support out of the box. I don't expect too much change from existing OS APIs though.
Still, I'm not able to give you a rule and the "correct" solution, simply because it doesn't exists.
Anyway, I hope to have pointed out the possible choices and scenarios that a programmer can face during the development of an application.

Now that you have read all of this, I hope that you are asking yourself these questions:
  • Do my programs correctly handle compiler differences about encodings and function behavoir?
  • Do my programs correctly handle variable length encodings?
  • Do my programs do the all the required conversions when passing strings around?
  • Do your programs correctly handle input and output encodings for text files?

Personal choices

Personally, if the application I'm writing does any text manipulation, I prefer to use an existing framework,such as Qt, and forget about the problem
I really prefer the maintainability and the reduced test-burden of this approach.
Anyway,if the strings are just copied and passed around, I usually stick with the standard C/C++ strings, like std::string or "tstring" when necessary.In this way I keep the program portable with a reduced set of dependencies.
Finally, when I write Windows-only programs I choose std::wstring, but then I use APIs to do everything.

C++ standard library status : again

As we have previously seen, standard C++ classes are not really unicode friendly, and the implementation level varies very much (too much!)between systems. Anyway if you are using a decent C++11 compiler you will found some utility classes that let you do many of these operations by only using standard classes.
I will write an addendum soon on this blog, and I will update the two unicode examples using C++11 instead of Qt, trying writing them in the most portable way possible.

Conclusions

I hope to get some feedback about all of this: I'm still learning it, and I think that shared knowledge always leads to better decisions. So, feel free to write me down an email or a comment below!