Sunday, February 6, 2011

Tesseract OCR: Setting Up Interactive Debug Environment On Windows

The following are the step-by-step instructions for setting up and running Tesseract’s internal state viewer (called "ScrollView") on Windows.

Although there already exists a dedicated wiki article (and the instructions herein are based upon it), it can cause some confusion for Tesseract newbies and those who don’t feel comfortable with the technology mixture required for the setup.
  1. First off, you need to make sure you have Java Runtime Environment (or simply “Java”) installed. If you haven’t, then go to http://www.java.com/en/download/manual.jsp and download it. Most likely, an offline version for Windows will suit you well. After the download completes, run the downloaded executable, follow several wizard steps and wait until the installation is finished.
  2. Tesseract’s viewer requires a few JAR files which hadn’t been changed for years and are a bit of hassle to get. So I decided to pack them all into a single archived installation suite along with the Tesseract 3.01 executable and other required minimal infrastructure. You can grab it here: http://www.4shared.com/get/Z4gnbJdP/tess_debug.html
  3. Then create some folder say C:\tess_debug and extract into it all the files from the downloaded installation suite preserving the folder structure.
  4. Launch the Windows Command Prompt and change the current directory to your folder by running the command
    cd C:\tess_debug
  5. Now you are ready to launch the Tesseract debug environment. My installation suite contains the test file phototest.tif so the command to display segmentation data for it would be
    tesseract phototest.tif test1 segdemo inter
    Type the above command in the Windows Command Prompt. The viewer window containing letter outlines should appear shortly.
    A few words on the command-line parameters used:
    • test1 indicates the name of the txt file which will be created as a result of Tesseract’s work. It will contain the recognized text.
    • segdemo and inter are config files required to run Tesseract in this kind of debug mode (segmentation debugging); you can see these within installation suite’s folder.
    • To run segmentation debugging with your file, indicate its name instead of phototest.tif. If your file is located outside installation suite’s folder then you’ll need to prefix the filename with the path.
    • The above command runs recognition using the default language file eng.traineddata. To use your own language file, specify it using the -l command-line argument e.g.
      tesseract image.tif test1 -l yourlang segdemo inter
      In order for this command to run successfully, the language file called yourlang.traneddata should be placed into the tessdata subfolder of the installation suite folder.
  6. The above paragraph describes how to debug the segmentation. Nearly the same technique is used to debug the classifier. One thing you need in order to change the debugging mode is to replace in the command line segdemo with matdemo, like this:
    tesseract phototest.tif test1 matdemo inter
    NOTE: The matdemo config file can also be found in the installation suite folder.
This is all that can be said about installation of and launching the Tesseract viewer. For information on how to use Tesseract viewer’s user interface please refer to http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging
more >>