Thursday, November 4, 2010

Getting Boost.Regex + Unicode to work with Microsoft VC++ 2010

[ This article pertains to Boost version 1.44. It may or may not remain useful in the future, when newer Boost versions appear, depending on how Boost.Regex’s developers would arrange the build process. ]

Although the subject above may seem to be overly specific, I think many people might want to use regular expressions with international text support in their programs being developed in Windows/Microsoft environment. I’m in no way a Boost.Regex expert or even seasoned Boost user. I just would like to share my “newbie” experience with programmers who are trying to achieve the same as I did. And for the current Boost’s version - 1.44 - it is not so easy.
And one thing should be mentioned before we start. I adhere to the principle of least resistance. For the process of software library linking that means the following: use vendor-supplied compiled binaries as much as you can and use your hands as sparingly as you can. Probably the moment when you need to build a library from sources (which can be quite troublesome) or do some manual tweaking will come. Maybe, but it’s a future, so leave all hassles for that future. For now we are going to get our libraries up and running, and to do it fast. At least on Windows platform, this approach seems to be sound.

OK, let’s start. As it’s noted here, to support Unicode you have two approaches: the simple one and the good one. For some cases the simple approach can suffice however actually it is not portable and has several disadvantages. To help you build a true Unicode-aware application, Boost.Regex - one of the best regular expression libraries - requires ICU – one of the best Unicode libraries. The following describes how to get Boost.Regex and ICU work together.

First we need to obtain the latest release of ICU from here. In the download page choose ICU4C as we need to interface the ICU library with C/C++ programs. Download the Windows binaries .zip file and unpack it to some folder on your hard drive, say C:\Distr\icu.

What we need to do in order to build Boost.Regex with ICU is described in detail here. However building from scratch is still not a preferred option, and when I was looking for an alternative I came across the BoostPro Installer which can provide us with binaries for every possible Boost library, every compiler, threading model, etc. But… BoostPro does not cover per-library specifics such as ICU-enabled Regex.

Therefore we have no other choice but to compile Boost.Regex by hand. However BoostPro Installer still can be of use for us because it can serve as a convenient download tool that will provide us with Boost sources and (what is more important) some compiled Boost tools (you will be able to see below why we need them). For now run BoostPro and proceed until the wizard page appears where compilers and variants can be selected. There you should select no checkboxes at all. On the next page leave only four checkboxes checked at the top of the list. Then advance to the end of the wizard and wait for files to be downloaded and installed.

Assume you chose to install Boost to the default installation folder (C:\Program Files\boost\boost_1_44). In the C:\Program Files\boost\boost_1_44\libs\regex\build folder you will find several vc*.mak files which – what a pleasant surprise! - outnumber those mentioned in the documentation on the Boost's web site. But what is not so good surprise is that they lack one for MSVC10. Running vc9.mak by VC10’s nmake.exe gives many errors and unknown option warnings which look a bit scary. Editing makefiles manually is not an option for us as we need the result as soon as possible.

Somehow guys at BoostPro had managed to built Regex library with MSVC10 and are unlikely to conceal their secret knowledge from the public, so it has to be somewhere around. In Boost.Regex’s build folder there are also some .sh files which are Unix shell scripts, a .cpp file and a cryptic Jamfile.v2. Relating this name and the information obtained from Regex’s building guide lets us conclude that we need a program called bjam (which in fact is a part of Boost’s own build system) to run Jamfile.v2. Fortunately, as a result of installation by BoostPro, a compiled bjam.exe can be found in the C:\Program Files\boost\boost_1_44\bin folder.

Now let’s launch a command line window and change the working directory to Regex’s build folder. The straight-forward (partially borrowed from the Regex installation guide and Google search results) command
"C:\Program Files\boost\boost_1_44\bin\bjam" -sICU_PATH="C:\Distr\icu" toolset=msvc-10.0 release threading=multi link=shared

among other info outputs an undesired message

has_icu builds : no,

which indicates that the test program has_icu_test.exe could not be compiled or linked correctly and the build script falls back to compilation/linking of Regex without ICU. After struggling for a while with this strange problem I found that by some reason the build tool requests for debug versions of ICU libraries which of course I have none of and didn’t intend to use. I must confess, I’m not fond of studying new software tools that I’m sure I wouldn’t use much on but here I had no other choice. Eventually I found that ‘lib’ rule sets select wrong alternatives, even when configuration features are explicitly provided via command-line arguments, just like in the command above. E.g. this block in the Jamfile.v2 file

   lib icuuc : : <search>$(ICU_PATH)/lib <link>shared <runtime-link>shared ;
   lib icuuc : : <toolset>msvc <variant>debug <name>icuucd <search>$(ICU_PATH)/lib <link>shared <runtime-link>shared ;
   lib icuuc : : <name>this_is_an_invalid_library_name ;

chooses the second line, i.e. the icuucd library (debug version of one of ICU’s libraries) but supposed to choose the first line, i.e. icuuc library (release version of that ICU’s library). Other rule sets behave in a similar way. I had no time to investigate why this may happen and I don’t see any reasons not to follow the path of least resistance and use a no-brainer: I commented out in the Jamfile.v2 file all alternatives which may lead to selection of incorrect libraries, like this:

   lib icuuc : : <search>$(ICU_PATH)/lib <link>shared <runtime-link>shared ;
#   lib icuuc : : <toolset>msvc <variant>debug <name>icuucd <search>$(ICU_PATH)/lib <link>shared <runtime-link>shared ;
#   lib icuuc : : <name>this_is_an_invalid_library_name ;
#   lib icudt : : <search>$(ICU_PATH)/lib <name>icudata <link>shared <runtime-link>shared ;
   lib icudt : : <search>$(ICU_PATH)/lib <name>icudt <toolset>msvc <link>shared <runtime-link>shared ;
#   lib icudt : : <name>this_is_an_invalid_library_name ;
#   lib icuin : : <search>$(ICU_PATH)/lib <name>icui18n <link>shared <runtime-link>shared ;
#   lib icuin : : <toolset>msvc <variant>debug <name>icuind <search>$(ICU_PATH)/lib <link>shared <runtime-link>shared ;
   lib icuin : : <toolset>msvc <variant>release <name>icuin <search>$(ICU_PATH)/lib <link>shared <runtime-link>shared ;
#   lib icuin : : <name>this_is_an_invalid_library_name ;

Mostly this is it. Save corrections to the Jamfile.v2 file and run the above command again. This should work the right way. My configuration for bjam was release threading=multi link=shared which means: I need release (non-debug) version, I need multi-threaded, I need DLL. You might probably want another configuration, then you may change release to debug, multi to single or shared to static.

When everything is going right, at the beginning of bjam’s console output you should see a message has_icu builds : yes.

As a result of bjam’s work, all generated files will be placed into the C:\Program Files\boost\boost_1_44\bin.v2\libs\regex\build\msvc-10.0\release\threading-multi folder (the last two subfolders can vary depending on your options).

To double check, you can examine a couple of .rsp files which also would reside in the output folder. E.g. the file c_regex_traits.obj.rsp should contain the line

   -DBOOST_HAS_ICU=1

And the file boost_regex-vc100-mt-1_44.dll.rsp should contain the lines

   "icuuc.lib"                   
   "icudt.lib"
   "icuin.lib"

That’s it. For how to use ICU-powered Boost.Regex please refer here

1 comment:

Peter Rudnik said...

At least some of it is still true for boost 1.53.0

you got to make changes to the regex jamfile.

Peter Rudnik
Berlin