Thursday, February 16, 2023

C++/C# interop for native applications using WinRT

The repository with the code for this article is here: https://github.com/QbProg/CppCSInterop
--

When you search C/C++ C# interop on google you find a lot of articles about it, but many of them (if not all) are focused on P/Invoke interop, which in the end is a "C" style interop, not a C++ one.

There is a number of tools (like CppSharp, SWIG, or others that can be found online) which are able to generate bindings from C++ to C#, even in a cross platform way.

On Windows, another common option is to use C++/CLI, which works well in both directions. Unfortunately it's becoming abandoned as it doesn't support C++20 and probably never will.

Still on Windows, there's another possibility for seamless interop : COM interfaces. This works without many troubles specially if one uses only the COM-ABI aspect (which I recently found being referred as nano-COM)

In Windows 10, finally, there's yet another option which works for native applications too and, thanks to higher level libraries is almost easy to integrated and implement, without getting lost in all the COM details which made it cumbersome to implement previously. The technology is WinRT, the same which originally powered UWP, and now can be used from Win32/desktop code too.

In this article I will show you my jurney for a interop system between .NET (core) and native C++ desktop code using WinRT.

Since WinRT and it's various projections are officially the de-facto technology used for all the new Windows API developments (e.g. WinUI 3), the technologies used here are well documented in the Microsoft documentation.
Though, all the examples and tutorials are much more focused on the user/consumer side that on the authoring side; and gluing them all together was indeed a little complex and required much trial-and-error, so I'm hoping that this article could be of help for someone.

Any feedback and improvement is welcome!

Motivation and requirements

In my projects, I'm looking specifically for way to replace C++/CLI.
My applications are native Win32 C++ applications (some even using Qt), and I have a number of components which are implemented using C# that I need to talk to; on the other side, I'd like to expose some C++ classes to the C# components too, so bidirectional interaction is needed.

This is the configuration and requirements of the project we will be using:

  • Native Windows desktop application target
    A complex unpackaged native C++ Win32 application composed by many executables and DLL's;
  • Run on all Windows 10 versions (at least starting from 1607)
    As today the oldest Windows LSB version will be supported until 2026, and unfortunately I have customers using it, so this was an hard requirement.
  • CMake as build system for c++ (using Visual Studio generators)
  • Bidirectional communication
    This means that I want to call C++ code from C# and C# from C++.
  • Bidirectional activation (i.e. create C++ objects in the C# side and vice-versa)
    In addition of passing objects I want to be able to instantiate objects on both sides.
  • Avoid too much code-bloat
    Adding interfaces and functions should be simple, possibly done only once, and without too much complexity.
  • Internal interop only
    Generated interop code will be used internally only by the application. We'll skip everything about versioning, compatibility, etc...
  • App-local deployment
    I want to deploy all(*) the dependencies in the application directory.
  • Registration-free activation and manifest-free activation
    Since the interop is done for interal use, I want to absolutely avoid registration the components; in addition, I dont't want to even manually create a manifest declaring all the classes exposed in C# or C++.

The jurney to fullfill all these requirements was kind of tricky but in the end I succeeded and I'll show you all the steps here.
The repository of the code, with a working example, is to be found here in github.

About WinRT

WinRT technology was introduced on Windows 8 to support UWP, and it was initially only available to UWP application developers.
Later on, on Windows 1809, Microsoft made the technology available to desktop applications too (Desktop Bridge).

In this article I take for granted some WinRT introductory knowledge and I will focus on the interop mechanism. I'm not really a WinRT expert on my side too. 

From an interop perspective, WinRT and it's projections provide a simple way to declare and pass primitives and structured data, and all the gluing is done by the system.

Note that WinRT types are really required for interop purposes only (e.g. to pass strings and arrays) and the usage of Windows UWP API's can be almost totally avoided. In my tests I just used native C++ functions together with some Windows.Forms functions. 

On the other side, nobody forbids to pass around UWP or WinUI objects and data and do interop in a full modern WinUI application.

About .NET 

These examples work with .NET 6. It seems to me that C#/WinRT don't support .NET framework 4.8; I may be wrong though, but the tooling is so well integrated in CS/WinRT that I didn't bother to check older technologies.

Tools used

In this article a number of tools were used:

Note that Visual Studio 2022, does not need UWP support installed. Just make sure that you have the correct Windows SDK versions, else the C# compilation will fail.

cppwinrt and CsWinRT will be added using nuget packages. Note that the nuget command line tool is not distributed with Visual Studio or the .NET SDK, so it has to be downloaded manually.

Setting up the repository

You'll find the repository here: https://github.com/QbProg/CppCSInterop

Just clone it and you'll find all the referenced material. Run <setup.bat>: the cmake scripts automatically downloads a bunch of dependencies and source code around the web. 
Please make sure to have the proper dependencies installed, specially the correct version of the Windows SDK. If not, just add it using the visual studio installer.

 The starting point

Microsoft has two tutorials for the creation of C# components and C++ components and consuming on the other side. These work for native desktop applications too.

https://learn.microsoft.com/it-it/windows/apps/develop/platform/csharp-winrt/net-projection-from-cppwinrt-component

https://learn.microsoft.com/it-it/windows/apps/develop/platform/csharp-winrt/create-windows-runtime-component-cswinrt

The code contained in these projections is good but there are a few drawbacks:

  • They use nuget to expose the package to the other side. This is a great choice for library authors, but a little too much for internal interop purposes.
  • They use Visual Studio and not CMake for C++ projects.
  • They use reg-free but not manifest-free activation.
To use CMake to develop cppwinrt projects I found this great starting point:

https://github.com/fredemmott/cmake-cpp-winrt-winui3

Thanks to the author for it. This repository is focused on WinUI 3 but gives a starting point to build a cppwinrt cmake project and,most importantly, it contains a workaround for a blocking cmake+cppwinrt compilation bug.

Creating a c++ WinRT component

So, after this long intruduction, let's start by creating a C++ WinRT component. In the repository, the reference is the CppComponent project.

cppwinrt is the official native WinRT projection; it is proposed as nuget packages and it basically provides the cppwinrt compiler which is capable to generate the boilderplate code for C++ users to implement the exposed classes.

The process of adding a c++ class (and any of it's functions) gets simply as

- creating a midl file with the class and it's functions (Class.idl)
- adding to CMake and building
- implementing the functions in the Class.cpp file

the library generated can be automatically be used in C# and all of the types declared will be available there.

In our example we created a cpp midl here:

namespace CppComponent
{
    [default_interface]
    runtimeclass Class
    {
        Class();

        String Hello();
    }
}


and the implementation looks like this:

#include "pch.h"
#include "Class.h"
#include "Class.g.cpp"

namespace winrt::CppComponent::implementation
{
    hstring Class::Hello()
    {
        return L"Hello World";
    }
}


Nothing really fancy. Just note the usage of cppwinrt projected types (hstring), which can also be constructed from standard types using helper functions. 

After building we get a CppComponent.dll which only depends on the visual c++ runtime and that be used in standalone way.

The CppComponent.dll contains the implementation of the c++ part, plus some generated cppwinrt code (the activation factory etc...)

In addition there is a CppComponent.winmd metadata file which will be used to generate projections from other languages.

In the IDL definition of the c++ part it's possible to add interfaces only that can be later used in C# to implement functionalities without having to expose a WinRT component on the C# side too. This may be useful if the scope of interop is just exposing the cpp object to C#.

Notes on the CMakeLists.txt file 

The CMake file for the Cpp component has a few addition in respect of a standard C++ project.

First of all, in the main CMakeLists.txt file, nuget gets downloaded with the cmake download functions, and the tool is stored in the "tools" directory.

Then nuget is used to restore cppwinrt nuget packages.

In the project CMakeFiles we then define a regular c++ target.

Since cppwinrt auto-generates the module definitions, we add the auto-generated file as a cmake dependency so it correctly recognizes it. In addition there are some global Visual Studio generator define that basically set the visual studio project to work correctly with the winrt options.

add_custom_command(OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/CppComponent.dir/$<CONFIG>/Generated Files/module.g.cpp" COMMAND echo )
set(ADDITIONAL_FILES packages.config "${CMAKE_CURRENT_BINARY_DIR}/CppComponent.dir/$<CONFIG>/Generated Files/module.g.cpp")

Finally, we copy the generated binaries to a common "distrib" directory; this step is explained below.

DLL Naming conventions

Since we'll want to use manifest-free activation it's important that the name of the DLL matches the name of the root namespace of our components. So in this case the namespace is CppComponent and the DLL is called with the same name. Nested namespaces and classes can follow this rule too.

Generating a C# projection for the Cpp component

To use any WinRT component from C# a CsWinRT projection assembly needs to be created. Once created, this assembly can be referenced from other C# projects like a regular assembly.

Unfortunately there seems to be no alternative to creating a separate C# projection assembly for each C++ WinRT component used, so we'll stick to this rule for the moment.

In the main CMake file I added the C# projection project as an external project. The C# project is a SDK-style project based on .NET 6.

INCLUDE_EXTERNAL_MSPROJECT(CppComponentProjection "${CMAKE_CURRENT_SOURCE_DIR}/CppComponentProjection /CppComponentProjection.csproj" PLATFORM "x64")

This is a regular C#/WinRT project, created with Visual Studio. A few considerations to look at:

  • The target framework has an explicit reference to windows (see the tutorial)
    <TargetFramework>net6.0-windows10.0.22000.0</TargetFramework>
    <Platforms>x64</Platforms>
    Make sure you have the exact windows SDK version installed, or the compiler will complain

  • Added the C#/WinRT reference to the project
    <PackageReference Include="Microsoft.Windows.CsWinRT" Version="2.0.1" /> 
  • We referenced c++ project using a relative path. Be careful of this if you change the directory structure.
    <ProjectReference Include="..\build\CppComponent\CppComponent.vcxproj" />
  • We referenced the CppComponent in the CsWinRT property group.
    <CsWinRTIncludes>CppComponent</CsWinRTIncludes>

In the end we got our C# projection of the component, that we can use in any other C# project like a regular assembly.

Note that, despite the Microsoft tutorial suggests, I didn't create a nuget package for the projection. If you are a component designer it is surely the way to go, but I just want to use these classes as an interop mechanism for my application.

Using the C# projection

In the CMake example I added a test C# executable, which uses the projection to instantiate the component.
Note that there were two issues in using the projection from the C# test executable:

  • The C# project needed to reference CsWinRT too; this was a bit unexpected but was required.
  • The projection's "ProjectReference" needed a special attribute SkipGetTargetFrameworkProperties="true" 
I don't really know what's going on here, if anyone has an idea please let me know.

The other way around: creating a C# WinRT component and using it in a native C++ project

The process described in the previous section is particularly useful if you have a C# application and want to integrate C++ code or libraries.
For native C++ applications the scenario may be the exact opposite: since C# has a number of great libraries and components, and UI programming is indeed more easy and intuitive to do in a managed environment, one could start integrating C# classes in an existing C++ native application (this was my case indeed).

So, let's create a C#/WinRT component for comsumption. 

In this case the Microsoft tutorial is a good starting point too. The creation of the C# component is easy, and I added it to CMake using the same external project method.

In our source the component is CSComponent .

In addition of just adding a C# class, I started using the classes from the previous C++ component projection, to show that mixing different component in different languages is possible.

In a C#/WinRT component every public class is exported in the metadata (winmd) and will be usable from the C++ side. The projection system will automatically handle the marshaling and type matching for you.
There may be attributes to control the details of marshalling and export, so please refer to the CsWinRT docs in this regard.

namespace CSComponent
{
    public sealed class CSClass
    {       
        public string Hello()
        {
            CppComponent.Class c1 = new CppComponent.Class();
            return c1.Hello();
        }
    }
}

We still used the convention that the root namespace should match the assembly name. This is really important in C# too, as we'll see later.

The C# component generates the CSComponent.dll assembly and it generates a number of other DLL's too, namely WinRT.Host.dll and WinRT.Host.shims.dll too.

These additional values are crucial for the usage in a manifest free activation, explained below.

Using the C# component from C++

Native c++ projects can use winrt directly without having to resort to any special UWP attribute or setup.

The main thing here is to point winrt to use the winmd file generated by the C# application; cppwinrt will generate the C++ interfaces and activation code for you.

From the user point of view it's just a matter of allocating and using the class in a typical cppwinrt way:

CSComponent::CSClass ex;
std::wcout << ex.Hello().c_str() << std::endl;

In my case I wanted to use the C# component not from the main EXE application, but from a DLL. This scenario is a slightly more complex, but exposes a number of critical issues in using the C# component in native code.

In our scenario we created a DLL (CSComponentClient.dll) and with a single export (dll_main) we called it from the main EXE. This is not really different than calling it from the main EXE.

If you follow the official tutorials, you see that you need to add the C# component entries in the main EXE manifest. In our case, we still want to follow the manifest-free activation route, and this is where we found some issues: if you run the code without modifications you'll see that it does not work.

Before solving the issue, let me write a quick recap of how COM/WinRT activation works on the C++ side.

Activation (instantiation) of C# WinRT classes

The activation system in winrt is kind of complex, and evolves from the venerable COM system.

When you want to create a COM/WinRT class, you generally need what is called an Activation Factory, i.e. a special factory class that instantiates concrete classes objects.

In COM, each class being instantiated needed to be registered in the Windows registry (remember the infamous regsvr32 tool?); this means that to use a COM component the installer/deployer of the component would have to register the component itself in the register. The registration entry lets the COM system know in which DLL the activation factory of a class is implemented and where to find such DLL when an object of the class is requested.

Later on (in the Windows XP era IIRC) reg-free COM activation was introduced; this allowed component users to distribute components locally in the application without having to register them, and most importantly to avoid clashes with different component version installed globally at system level. 
The main idea was to just declare where to find all the activation factories inside the manifest file of the main EXE application instead of registering them in the registry.

Unpackaged WinRT apps usually follow the same approach. You can use WinRT components in your application as long as you declare these in the main manifest EXEs.

If you look at the tutorial step, you'll see a manifest like this in the native EXE console application:

<?xml version="1.0" encoding="utf-8"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
    <assemblyIdentity version="1.0.0.0" name="CppConsoleApp"/>
    <file name="WinRT.Host.dll">
        <activatableClass
            name="AuthoringDemo.Example"
            threadingModel="both"
            xmlns="urn:schemas-microsoft-com:winrt.v1" />
    </file>
</assembly>

You see a few things here:

  • The manifest is declared on the EXE file
  • The referenced DLL is WinRT.Host.dll and surprisingly not CSComponent.dll
  • Each different class exported by CSComponent need to be put here

Turns out that each C# winrt component generates two additional DLLs containing the activation factory code; these DLLs are named WinRT.host.dll and WinRT.host.shim.dll
Some more details can be found here.

These Host DLLs contain native code to 

  1. Activate the .NET core engine if required 
  2. Load the CSComponent assembly. 
  3. Export the native DLL entry points for the winrt system use them.

This approach works, but for requirements it has a number of issues:

  • For each class exposed an update to the client (EXE) manifest will be needed.
  • If case of multiple components, the WinRT.host.dll name clashes, as each component should have it's own unique version.
  • Which manifest is to be updated if it is not the EXE to call the C# component but a different DLL?

As already explained above, in our case the solution adopted is to avoid manifests entirely using manifest-free activation.

Manifest free activation for our C# component

Like the C# counterpart, cppwinrt supports manifest-free activation too. The issue here is that the rules used by the cppwinrt version don't match the (default) C# DLL naming convention. In particular, the cppwinrt activator won't look for WinRT.host.dll name, so that DLL should be renamed. If you looked at the previous link 's information, there are three potential naming conventions to use. We choose to rename the host DLL and keep the main assembly name intact, because in the end it's a more intuitive approach.

By default cppwinrt will look for a DLL with the same name(s) of the root namespace. The same logic applyes for nested namespaces. So in our case the loader will look for the RoGetActivationFactory function inside the CSComponent.dll 

Unfortunately the CSComponent.dll is a C# assembly and does not export any such function. The function is indeed contained in the auto-generated WinRT.Host.dll

As the winrt manifest-free activation logic cannot be customized and there's no way to tell the cppwinrt system to load the WinRT.Host.dll instead. This is a hard blocker for our scenario (which I reported here), but it seems it was a non-intended use-case.

Fortunately, there's a detour on the winrt default activation function, probably implemented for testing purposes, that can be hacked to implement the correct logic. It's just a matter of setting a function pointer in winrt and the custom activation logic will be used in addition of the standard one.

To do this we set the detour once per process (in the EXE main or in the DLL main or any other utility function) and there we implement our custom DLL matching logic. 

winrt_activation_handler = custom_winrt_activation_handler;
int32_t __stdcall custom_winrt_activation_handler(void* classId, winrt::guid const& guid, void** result) noexcept

note: the signature of this function could be cppwinrt version-specific

The detour code is copyed straight from cppwinrt source, the only logic that changes is that it tryes one more DLL with the .Host.dll suffix
See the WinRTCustomActivation.cpp for the sample code.

So in our case it will look for <root_namespace>.Host.dll , and it will only do so after the standard DLL matching has not failed.

CSComponent.dll -> fail
CSComponent.host.dll -> success

In the C# side of the project, we'll implement a custom post-build event which copies the local WinRT.Host.dll to a more specific name. We'll end with these names in the output directory:

CSComponent.Host.dll
CSComponent.Host.runtimeconfig.json

So many DLL's , so many files

At this point the example should run, but it does so because there are post-build events that put the DLL's in a specific "distrib" folder.

Without these steps the application would fail since the manifest-free loaders won't be able to load the DLL's in the application path. This is of course similar of how regular EXE and DLL's work.

So if you look in the cmake and csproj files you'll see that there is some custom post-build logic to:

- Copy the CppComponent DLL's in the application directory
- Copy the CppComponent projection in the application directory
- Rename the CSComponent WinRT.Host.dll to CSComponent.Host.dll
- Copy the CSComponent files to the application directory

After the copy, the EXE file need to be run from the application directory and not from the build directory.
This is still done in CMake by setting the proper visual studio debug options at configure time.

Also note that the C# components require the .runtimeconfig files to propertly run, dont forget to deploy these too!

Applocal deployment

Up to this point the application will correctly run in a developer workstation. To run it in a standalone ("clean") machine, two things are required:

  • The Visual c++ redistributables
    These can be copied directly in the application directory structure to avoid the system installation. A configure/install step can be used for this. In our example we copy these from the visual studio directory.

    Note: somewhere in the UWP days there was a need for VCRT forwarders, but I found it's not the case anymore.

  • The .NET 6.0 re-distributable installed
    I could not find to avoid the system-wide installation. Surely a dotnet self contained publish could work, but I didn't find a straightforward way to do this for a mixed C++/C# project. Any contribution is welcome on this regard :)

Supporting versions prior Windows 10 - 1809

All these examples run smoothly up to Windows 10 versions 1809, which introduced Desktop Bridge and UWP API's usage for unpackaged applications.

Fortunately enough, Microsoft made available the xlang project repository , which contains the winrtact.dll that,when loaded inside one process, will detour the WinRT activation APIs and make the whole system work in windows versions up to 8.1.
I only tested this up to windows 1607, which was my minimal system requirement. Finding a virtual machine for it was already hard enough :)

To get that library company one has to download the repository and build it by hand using the Visual Studio solution provided.
Then, to enable detouring, one has to link the winrtact.dll by forcing a symbol on it (or just use LoadLibrary). The library is smart enough to auto-detect when it's needed so this can be used in all Windows versions.

Since I was using a cmake project system, I quickly built a cmake script (only for the x64 version of that library) and build it directly in the main project structure.
Since I don't want to mess with the Microsoft source licenses, the real sources are downloaded on the fly by the cmake configure script and copied in the source directory.
The winrtact.dll is then copied to the application directory with a post-build event.

We then use the library as a link target for the main executable and force a linker symbol on it.

I didn't find many users of this library in the web, it seems only winget uses it, but I'm not sure. It will be quickly outdated as these older versions of Windows get replaced by newer ones.

Wrapping up

Finding a solution for all the issues has taken a long time, but in the end I got a template project that I can use in my production code.

All of this may seem complex, but once you have an example working it can be quickly re-used and the actual integration time is near zero.

Some steps could be further automated: for instance the C++ -> C# projection projects could be auto-generated in the cmake-configuration steps using a configuration template.

Of course the C++ and c# projects can be built separated too, and possibly use different build systems. In this case I preferred a single solution to quickly work on both languages.

I hope this is just a shortcut for developers doing the same thing and finding the same roadblocks.

Please provide me any feedback if any of these instructions is wrong or if there's a better way to do some things. 

As next steps, I will try to do a proper interop project in my production projects and report here any issue found.

Monday, February 13, 2023

Qt sciter integration

 Just for information, I recently published a small repository containing a Qt 6.x integration for the Sciter component.

Sciter is a nice cross platform "Embeddable HTML/CSS/JavaScript engine". It is not FOSS but has affordable licenses and commercial source-access.

Have fun!

Saturday, March 23, 2013

Building a git network visualization tool


I always liked the Git network viewer (i.e. see this example). In the "early git days", I found it very useful to have a visible return of the git operations I was performing.
As today I found it more clear than the regular vertical log viewer.
Unfortunately it seems that only github provides a similar functionality, and of course it works only on remote repositories.

So, I decided to write a similar tool by myself :) I picked up Qt 5.0 and libgit2 and built something from scratch, in a few hours.
I thought it was a relatively simple operation, but in the end I spent a good amount of time fighting with QML performance. At the moment I just reached a decent level of performance (at least for a first version), and after I complete the GUI with a basilar set of options, I will publish it as OSS.

After the first version, I plan to add navigation functionalities the github viewer doesn't have and also a "graphs" page. This kind of visualization it's not really nice in repositories where many branch are involved, but I think I can find a way to display nicely these too...

So, stay tuned :)



Monday, February 4, 2013

Fun with composition and interfaces

As one of most important programming concepts is code reuse, let's see some nice ways to use C++ templates and make code-reuse simple and efficient. Here it follows a bunch of techniques that have proven useful in my real life programming.

Suppose you have a set of unrelated classes, and you want to add a common behavior to each of these.

class A : public A_base
{

};

class B : public B_base
{

}
...
 
you want to add a set of common data and functions to each of these class. These can be extension functions (like loggers,benchmarking classes, etc...), or algorithm traits (see below).

class printer
{
public:
    void print ()
    {
      std::cout << "Print " << endl;
    }
};
The simplest way to accomplish this is using multiple inheritance:
 
class A : public A_base , public printer
 {

 }

That's easy , but the printer class won't be able to access any of the A' members. In this way it is possible to compose only classes which are independent from each other.

Composition by hierarchy insertion

Suppose we are willing to access a class member to perform an additional operation. A possible way is to insert the extension class in the hierarchy.
 template <class T>
 class printer : public T
 {
  public:
    void print ()
    {
         std::cout << "Class name " << this->name() << endl;
    }
 }

The class printer now relies on the "std::string name()" function being available on it's base class. This kind of requirement is quite common on template classes, and until we get concepts we must pay attention that the required methods exists on the extending classes.
BTW, type traits could eventually be used in place of direct function calls.
The class can be inserted in a class hierarchy and the derived classes can access the additional functionality.
 
class A_base
{
public:
    std::string name () { return "A_base class";}
}  

class A : public printer<A_base>
{

} ;

int main ()
{
    A a;
    a.print();
    return 0;
}

this technique can be useful to compose algorithms that have different traits, and share code between similar classes that don't have a common base.
This last examples is not a really good one for multiple reasons:
  • In case of complex A_base constructor, it's arguments should be forwarded by the printer class. Using C++11 delegating constructors will make things easier, but in C++98 you'll have to manually forward each A_base constructor , making the class less generic.
  • inserting a class in a hierarchy can be unwanted.
  • If you want to access A (not A_base) members from the extension class, you need to add another derived class, deriving from printer<A>
Anyway, this specific kind of pattern can still be useful to reuse virtual function implementations:
class my_interface
{
public:
   virtual void functionA () = 0;
   virtual void functionB () = 0;
};

template <class Base>
class some_impl : public Base
{
     void functionA () override;
};

class my_deriv1 : public some_impl<my_interface>
{
   void functionB() override;
};

In particular, if my_base is an interface, some_impl can be used to reuse a piece of the implementation.

Using the Curiously recurring template pattern.

Now it comes the nice part: to overcome the limitations of the previous samples,a nice pattern can be used: the Curiously recurring template pattern.

template <class T>
class printer
{
public:
   void print ()
   {
       std::cout << (static_cast<T*>(this))->name() << std::endl;
   }
};

class A : public A_base, public printer<A>
{
public:
   std::string name ()
   {
       return "A class";
   }
};

int main ()
{
     A a;
     a.print ();
     return 0;
}

Let's analyze this a bit: the printer class is still a template, but it doesn't derive from T anymore.
Instead, the printer implementation assumes that the printer class will be used in a context where it is convertible to T*, and will have access to all T members.
This is the reason of the static cast to (T*) in the code.

If you look at the code of the printer class alone, this question arises immediately:
how comes that a class unrelated to T can be statically casted to T* ?

The answer is that you don’t have look at the printer class “alone” : template classes and functions are instantiated on the first usage.
When the print() function is called, the template is instantiated. At that point the compiler already knows that A is derived from printer<A> so the static cast can be performed like any down cast.

As you can see, with this idiom you can extend any class and even access it's members from the extending functions.
You may have noticed that the extension class can only access public members of the extending class. To avoid this , the extension class must be made friend:
template <class T>

class printer
{
 T* thisClass() { return (static_cast<T*>(this)) };

public:

   void print ()
   {
       std::cout << thisClass()->name() << std::endl;
   }
};

class A : public A_base, public printer<A>
{
  friend class printer<A>;

private:

   std::string name ()
   {
       return "A class";
   }
};

I've also added a thisClass() utility function to simplify the code and place the cast in one place (const version left to the reader).

Algorithm composition

This specific kind of pattern can be used to create what I call “algorithm traits”, and it’s one of the ways I use it in real code.

Suppose you have a generic algorithm which is composed by two or more parts. Suppose also that the data is shared and is eventually stored in another class (as usual endless combinations are possible). Here I’ll make a very simple example, but I’ve used it to successfully compose complex CAD algorithms:
template <class T>
class base1
{
protected:
    std::vector Data();
    void fillData ();
};

template <class T>
class phase1_A
{
protected:
    void phase1();
};

template <class T>
class phase1_B
{
protected:
    void phase1();
};

template <class T>
class phase2_A()
{
protected:
   void phase2();
};

template <class T>
class phase2_B()
{
protected:
   void phase2();
};

template <class T>
class algorithm
{
public:
    void run ()
    {
          fillData();
          phase1();
          phase2();
    }
};

// this would be the version using the “derived class” technique
//public class UserA : public Algorithm<phase2_a<phase1_a<base1>>>;

class comb1 : public algorithm<comb1>,phase1_A<comb1>, phase2_A<comb1>,base1<comb1> {};
class comb2 : public algorithm<comb2>,phase1_B<comb2>, phase2_A<comb2>,base1<comb2> {};
class comb3 : public algorithm<comb3>,phase1_A<comb3>, phase2_B<comb3>,base1<comb3> {};
class comb4 : public algorithm<comb4>,phase1_B<comb4>, phase2_B<comb4>,base1<comb4> {};
...
This technique is useful when the algorithms heavily manipulate member data, and functional style programming could not be efficient (input->copy->output). Anyway.... it's just another way to combine things.
A small note on performance: the static_casts in the algorithm pieces will usually require a displacement operation on this (i.e a subtraction), while using a hierarchy usually results in a NOP.
This technique can also be mixed with virtual functions and eventually the algorithm implemented in a base class while the function overrides composed in the way I just exposed.

Interfaces

As we saw, extension methods allow to reuse specific code in unrelated extending classes. In high level C++, the same thing is often accomplished with interfaces:
class IPrinter
{
public:
  virtual void print () = 0;
};

class A : public A_base , public IPrinter
{
public:
 void print () override { std::cout << "A class" << std::endl; }
};

in this case each class that implements an interface requires to re-implement the code in a specific way. The (obvious) advantage of using interfaces is that instances can be used in an unique way, independently from the implementing class.
void use_interface(IPrinter * i)
{
  i->print();
}

A a;
B b; // unrelated to a
use_interface(&a);
use_interface(&b);

this it is not possible with the techniques of the previous sections, since even if the template class is the same, the instantiated classes are completely unrelated.
Of course one could make use_interface a template function too. Surely it can be a way to go, specially if you are writing code heavily based on templates. In this case I would like to find an high-level way, and reduce template usage (consequently code bloat and compilation times too).
class IPrinter
{
public:
  virtual void print () = 0;
};

template <class T>
class Implementor : public IPrinter
{
    T* thisClass() { return (static_cast<T*>(this)) };

public:
  void print () override
  {
      return thisClass()->name ();
  }
};

the Implementor class implements the IPrinter interface using the composition technique explained before and expects the name() function to be present in the user class.
It can be used in this simple way:
class A : public A_Base, public Implementor
{
    std::string name () {return "A";}
}

int main ()
{
    A a;
    IPrinter * intf = &a;
    intf->print();
    use_interface(intf);
    return 0;
}
Some notes apply:
  • This kind of pattern is useful when you have a common implementation of an interface, that depends only in part on the combined class ( A in this case )
  • Since A derives from Implementor, it's also true that A impelemnts IInterface; up-casting from A to IInterface is allowed.
  • Even if Implementor doesn't have data members, A size is increased due to presence of the IPrinter VTable.
  • Using interfaces allows to reduce code bloat in the class consumers, since every Implementor derived class can be used as (IPrinter *)
    There's a small performance penalty though, caused by virtual function calls and increased class size.
  • The benefit is that virtual function calls will be used only when using the print function thought an IPrinter pointer. If called directly, static binding will be used instead. This can be true even for references, if the C++11 final modified is added to the Implementor definition.
 void print () final override;
A a;
A & ref = a;
IPrinter * intf = &a;
a.print ();        // static binding
ref.print();       // dynamic binding (static if declared with final)
intf->print (); // dynamic binding;

This kind of composition doesn't have limits on the number of interfaces.
class mega_composited : public Base , public Implementor<mega_composited>, public OtherImplementor<mega_composited>.....
{

};


Adapters

These implementors can be seen as a sort of adapters between your class and the implemented interfaces. This means that the adaptor can also be an external class. In this case you will need to pass the pointer to the original class in the constructor.
template <class T>
class Implementor : public IPrinter
{
    T* thisClass;

public:
  Implementor (T * original): thisClass(cls)
  {
  }

  void print () override
  {
      return thisClass->name ();
  }
};
note that thisClass is now a member and is initialized in the constructor.
 ...
 A a;
 Implementor impl(&a);
 use_interface(&impl);
as you see , the implementor is used as an adapter between A and IPrinter. In this way class A won't contain the additional member functions.
Note: memory management has been totally omitted from this article. Be careful with these pointers in real code!
Eventually one can make the object convertible to the interface but keeping it as object member (sort of COM aggregation).
class A
{
public:

   Implementor<A> printer_impl;
   /*explicit */ operator IPrinter * { return &printer_impl; }

   A () : printer_impl(this) {};
};
or even lazier...
class A
{
public:
     unique_ptr<Implementor<A>> printer_impl;

     /* explicit */ operator IPrinter * () {
       if (printer_impl.get() != nullptr)
          printer_impl.reset(new Implementor(this));
       return &printer_impl;
     }    
};


I will stop here for now. C++ let programmers compose things in many interesting ways, and obtain an high degree of code reuse without loosing performance.
This kind of composition is near the original intent of templates, i.e. a generic piece of code that can be reused without having to use copy-and-paste! Nice :)

Friday, February 1, 2013

Write a C++ blog they said...

... it will be fun they said! :)
Indeed I already wrote two more articles, but it's the editing part that is time consuming:
  • Read the article over and over and make sure the English is good enough.
  • Do proper source code formatting
  • Check that the code actually compiles and works
  • Make sure that the whole article makes sense, and so the smaller parts.
all of this can double the time required to write the original article text.
On the meantime I'll try to do smaller updates on smaller topics

So,next time we'll have some fun with interfaces and class composition! Stay tuned!

Sunday, January 6, 2013

Unicode and your application (5 of 5)


Other parts: Part 1 , Part 2 , Part 3 , Part 4

Here it comes the last installment on this series: a quick discussion on output files, then a resume on the possible choices using the C++ language.

Output files

What we actually saw for input files applies as well to output files.
Whenever your application needs to save a text output file for the user, the rules of thumb are the following:

  • Allow the user to select an output encoding the files he's going to save.
  • Consider a default encoding in the program's configuration options, and allow the user to change it using the output phase.
  • If your program loads and saves the same text file, save the original encoding and propose it as default.
  • Warn the user if the conversion is lossy(e.g. when converting from UTF8 to ANSI)

Of course this doesn't apply for files stored in the internal application format. In this case it's up to you to choose the preferred encoding.

The steps of text-file saving are the mirrored steps of text-file loading:
  1. Handle the user interaction and encoding selection (with the appropriate warnings)
  2. Convert the buffer from your internal encoding to the output encoding
  3. Send/save the byte buffer to the output device
I won't bother with the pseudo-code to do these specific steps, instead I have updated the unicode samples on github with "uniconv" , an utility that is an evolution of the unicat one: a tool that let's you convert a text file to a different encoding.
  uniconv input_file output_file [--in_enc=xxxx] --out_enc=xxxx [--detect-bom] [--no-write-bom]
it basically let's you choose a different encoding both for input and output files.

To BOM or not to BOM?


When saving an UTF text file, it's really appropriate to save the BOM in the first bytes of the text file. This will allow other programs to automatically detect the encoding of the file (see previous part).
The unicode standard discourages the usage of the BOM in UTF8 text files. Anyway, at least in Windows, using an UTF8 BOM is the only way to automatically detect an UTF8 text file from an "ANSI" one.
So, the choice it's up to you, depending on how and where the generated files will be used.
Personally, I tend to prefer the presence of BOM, to leave all the ambiguities behind.
Edit: since the statements above seem to be a personal opinion, and I don't have enough arguments neither pro or against storing an UTF8 BOM, I let the reader find a good answer on his own! I promise I'll come back to this topic.

You have to make a choice

As we have seen in the previous article, there are a lot of differences between systems, encodings and compilers. Anyway, you still need to handle strings and manipulate them. So, what's the most appropriate choice for character and string types?
Posing such a question in a forum or StackOverflow, could easily generate a flame war :) This is one of the decisions that depends on a wide quantity of factors and there's not a definitive choice valid for everyone.

Choosing a string and encoding type doesn't mean you have to use an UNICODE encoding at all, also it doesn't mean that you have to use a single type of encoding for all the platforms you are porting your application in.

The important thing is that this single choice is propagated coherently across your program.

Basically you have to decide about three fundamental aspects:
  1. Whether using standard C++ types and functions or an existing framework
  2. The character and string type you will be using
  3. The internal encoding for your strings
Choosing an existing framework will often force choices 2) and 3).
For instance, by choosing Qt you will automatically forced to use QChar and UTF-16.
Choices 2) and 3) are strictly related, and can be inverted. Indeed, choosing a data-type will force the encoding you use, depending on the word size. In alternative one can choose a specific encoding and the data-type will be choosed by consequence.

Depending on how much portable your software needs to be, you can choose between:
  • a narrow character type (char) and the relative std::string type
  • a wide character type (wchar) and the relative std::wstring type
  • a new C++11 character type
  • a character type depending on the compilation system
Here it follows a quick comparison between these various choices.

Choosing a narrow character type

This means choosing the char/std::string pair.
Pros:
  • it's the most supported at library level, and widely used in existing programs and libraries.
  • the data type can be adapted to a variety of encodings, both with fixed and variable length (eg Latin1 or UTF8).
Cons:
  • In Windows you can only use fixed length encodings , UTF8 is not supported unless you do explicit conversions.
  • In linux the internal encoding can vary between systems, and UTF8 is just one of the choices. Indeed you can have a fixed and variable encoding depending on the locale.

Let me mark once again that choosing char in a Windows platform is a BAD choice, since your program will not support unicode at all.

Choosing a wide character type

This means using wchar_t and std::wstring.

Pros:
  • wide character types are actually more unicode-friendly and less ambiguous than the char data type.
  • Windows works better( is built on!) with wide-character strings.
Cons:
  • the wchar_t type has different sizes between systems and compilers , and the encoding will change accordingly.
  • Library support is more limited , some functions are missing between standard library implementations.
  • Wide characters are not really widespread, and existing libraries that choose the "char" data type can require character type conversions and cause troubles.
  • Unixes "work better" with the "char" data type.

Choosing a character type depending on the compilation system

This means that the character type is #defined at compilation type and it varies between system.
Pros:
  • the character type will adapt to the "preferred" one of the system. For instance, it can be wchar_t on windows and char on unixes
  • You are sure that the character type is well supported by the library functions too
Cons:
  • you have to think in a generic way and make sure that all the functions are available for both data types
  • Not many library support a generic data type, and the usage is not widespread in unixes, more in windows
  • You will have to map with a define all the functions you are using.
Have you ever met the infamous tchar type in Windows? it is defined as
#ifndef UNICODE
  #define TCHAR char
#else
  #define TCHAR wchar_t
#end
the visual C++ library also defines a set of generic "C" library functions, that map to the narrow or wide version.
This technique can be used between different systems too , and indeed works good. The bad thing is that you will have to mark all your literals with a macro that also maps to char or wchar version.
#if !defined(WIN32) || !defined(UNICODE)
  #define tchar char
  #define tstring std::string
#else
  #define tchar wchar_t
  #define tstring std::wstring
#end

tstring t = _T("Hello world");

I never have seen using this approach outside Windows, but I have used it sometimes and it's a viable choice, specially if you don't do too much string manipulations.

Choosing the new C++11 types

Hei, nice idea. This means using char16_t and char32_t (and the relative u16string and u32string).
Pros:
  • You will have data-types that have fixed sizes between systems and compilers.
  • You only write and test your code once.
  • Library support will likely improve in this direction.
Cons:
  • As today, library support is lacking
  • Operating systems don't support these data types, so you will need to do conversions anyway to make functions calls.
  • The UTF8 strings still use the char data type and std::string, increasing ambiguity (see previous chapters).


Making a choice,the opposite way

As we have seen in the part above, making the choice about the data type will lead you to different encodings in different runtimes and systems.
The opposite way to take a decision is choosing an encoding and then infer the data-type from it. For instance , one can choose UTF8 and select the data type consequently.
This kind of choice is "against" the common C/C++ design and usage,because as we have seen, encodings and data type tend to be varying between platforms.
Still,I can say that is probably a really good choice, and the choice that many "existing frameworks" took.
Pros:
  • You write and test code only once (!)
Cons:
  • Forget about the C/C++ libraries unless you are willing to do many conversions depending on the system (and loosing all the advantages).
  • This kind of approach will likely require to use custom data types
  • Libraries written using standard C++ will require many conversions between string types.
In this regard,let's consider three possible choices:
UTF8:
  • You use the char data type and be compatible. I suggest not doing so to avoid ambiguity.
  • In windows it is not supported, so you will require to convert to UTF16/wchar_t inside your API calls
  • In unixes that support UTF8 locales, it works out-of-the-box.
UTF16:
  • If you have a C++11 compiler you can use a predefined datatype, else you will have invent one of your own that is 16bits on all systems
  • In windows , all the APIs work out of the BOX, in linux you will have to convert to the current locale (hopefully UTF8).
UTF32:
  • Again, if you have C++11 you can use a predefined datatype, else you will have invent one of your own that is 32bits on all systems
  • On any system you will have to do conversions to make system calls.
This approach, taken by frameworks such as Qt, .NET , etc... requires a good amount of code to be written. Not only they choose an encoding for the strings, but also these contain a good number of supporting functions, streams, conversions,etc...

Finally choosing

All of this seems like a cul de sac :)
To resume, one has either to choose a C++ way of doing things that is very variable between systems, or a "fixed encoding" way forced to use existing frameworks.

Fortunately the C++ genericity can abstract the differences, and hopefully standard libraries will improve unicode support out of the box. I don't expect too much change from existing OS APIs though.
Still, I'm not able to give you a rule and the "correct" solution, simply because it doesn't exists.
Anyway, I hope to have pointed out the possible choices and scenarios that a programmer can face during the development of an application.

Now that you have read all of this, I hope that you are asking yourself these questions:
  • Do my programs correctly handle compiler differences about encodings and function behavoir?
  • Do my programs correctly handle variable length encodings?
  • Do my programs do the all the required conversions when passing strings around?
  • Do your programs correctly handle input and output encodings for text files?

Personal choices

Personally, if the application I'm writing does any text manipulation, I prefer to use an existing framework,such as Qt, and forget about the problem
I really prefer the maintainability and the reduced test-burden of this approach.
Anyway,if the strings are just copied and passed around, I usually stick with the standard C/C++ strings, like std::string or "tstring" when necessary.In this way I keep the program portable with a reduced set of dependencies.
Finally, when I write Windows-only programs I choose std::wstring, but then I use APIs to do everything.

C++ standard library status : again

As we have previously seen, standard C++ classes are not really unicode friendly, and the implementation level varies very much (too much!)between systems. Anyway if you are using a decent C++11 compiler you will found some utility classes that let you do many of these operations by only using standard classes.
I will write an addendum soon on this blog, and I will update the two unicode examples using C++11 instead of Qt, trying writing them in the most portable way possible.

Conclusions

I hope to get some feedback about all of this: I'm still learning it, and I think that shared knowledge always leads to better decisions. So, feel free to write me down an email or a comment below!

Monday, December 10, 2012

Unicode and your application : part 4 example

I just wrote a simple example with Qt on how to read (and print-out) a text file. As I suggested in part 4, the program let the user choose the default encoding and eventually auto-detect it.
In this github repository I added the unicat application, which is a console application that can print a single text file.
You can download the repository as ZIP file here: https://github.com/QbProg/unicode-samples/archive/master.zip

Usage

unicat [--enc=<encoding_name>] [--detect-bom] <filename>

  • If the application is launched without a filename, it prints a list of supported encodings.
  • If --enc is not passed , it uses the system default encoding
  • if --enc is passed , it uses the selected encoding.
  • If --detect-bom is passed, the program tryes to detect the encoding from the BOM.If a BOM is not found, it uses the passed (or default) encoding.
The file is then printed to stdout.

Note: on Windows, the console must have an Unicode font set, like Lucida Console, or you won't see anything displayed. Also , even with a console font , you won't see all the characters.

The repository contains some text files encoded with various encodings. Feel free to play with these files and with the program options.

Building

To build the unicat sample, point open the .pro file with QtCreator or run qmake from the sources dir. To build it use make or nmake, or your preferred build command.

The code

The code is pretty simple. It first uses qt functions to parse the program options.
It then reads the text file: Qt uses QTextCoded to abstract the coding and decoding of a binary stream to a specific string encoding.

The function QTextCoded::availableCodecs() is used to enumerate all the possible encodings supported in the system.
Q_FOREACH(const QByteArray & B , QTextCodec::availableCodecs())
   {
       QString name(B);
       std::cout << name.toStdString() << std::endl;
   }

If the user passes the encoding, it tryes to load a specific QTextCoded from it, else it uses the default QTextCoded, which usually is set to "System"
if (encoding.isEmpty())
    userCodec = QTextCodec::codecForLocale();
else
    userCodec = QTextCodec::codecForName(encoding.toAscii());

After a file is open, the program uses a QTextStream to read it:
QTextStream stream(&file);
stream.setAutoDetectUnicode(detectBOM);
stream.setCodec(userCodec);
this is the initialization part, which specifies if a BOM should be auto-detected and the default encoding to be used.
Reading the text and displaying it is just a matter of readline and wcout.

"Why are you using wcout?"

If you mind the previous parts, we have that the default std::string encoding is UTF8 on Linux, but ANSI (i.e. the active 8bit code page) on Windows. You can't do anything about this, since the CRT works that way.
In addition, std::string filenames won't work on Windows for the same reason. So,in this example, the most portable way to print unicode strings is by using wcout + QString::toStdWString().

There are also other problems with the Windows console: by default Visual C++ streams and windows console don't support UTF16 (or UTF8) output. To make it work you have to use this hack:

 _setmode(_fileno(stdout), _O_U16TEXT);  

this allows to print unicode strings to console. Keep in mind the considerations done in the previous chapter, since not all fonts will display correctly all the characters.
BTW, I didn't find any problem in Linux.

Other frameworks

It would be nice to get an equivalent of this program (portable in the same way) using other frameworks or using only standard C++, to compare the complexity of each approach. Feedback is appreciated :)