An Extensible Optical Character Recognition Framework

Overview Creating Existing

An Overview of Modules

This documentation discusses the fundamental concept of a OCR Module, abbreviated simply as Module. It also discusses how Conjecture combines a collection of Components into a Module, so an understanding of components is useful when reading this page (unfortunately, an understanding of modules is also useful when reading about components :-)

Conjecture provides support for analyzing, assessing, comparing, and modifying an unlimited number of OCRs. This is accomplished by defining an OCR Module hierarchy. Each class in this hierarchy represents a fully-functional OCR capable of taking an image as input and producing text as output. Each class in this hierarchy must deal with input image processing (noise removal, orientation adjustment, etc.), segmentation into glyphs (and words and lines and columns), identification of unicodes for each glyph, the production of properly formatted output, and everything else that goes into an OCR. However, these classes almost never do all of this work directly. Instead, they usually delegate responsibility for implementing individual issues to an Implementation of the Component representing that issue.

To clarify this more, suppose we want to create a new module. The name of this module is important both because it usually establishes the name of the class that will implement it, and also establishes the value to pass as the value of the command-line -A flag to indicate that this OCR module should be used to perform image-to-text translation. For our example module, we will choose the name 'test' (the value to provide to the -A flag), and use the class TestModule.

Now, it is possible to implement the entire image-to-text functionality using the TestModule class itself (along with any number of helper classes, of course). However, by implementing all of the code for the OCR in this fashion, it makes it difficult to experiment with alternative strategies for the various sub-components making up an OCR. The implementation becomes "rigid" and inflexible.

A much more flexible approach is to identify the various subcomponents, formalize them into Component class hierarchies (in which the root class provides an interface and subclasses represent alternative implementations), and have TestModule delegate responsibility for performing the various actions necessary to appropriate component implementation instances. This is the strategy choosen for the Conjecture framework.

Classes in the OCR Module hierarchy are usually very simple. So simple, in fact, that they can be automatically generated by the Conjecture infrastructure based on an entry in a special input file (see the discussion on Conjecture.modules).

The OCR Module hierarchy has an abstract class called, not surprisingly, OCRModule, as its root. This class defines a special interface consisting of a collection of Factory Methods (methods responsible for creating objects). One Factory Method exits for each Component that Conjecture has identified and formalized. The OCRModule class also defines one field for each component. During creation, OCRModule uses the factory methods (redefined in subclasses to provide subclass-specific component implementation instances) to obtain instances to initialize these fields with.

Subclasses of the abstract OCRModule class represent concrete, fully-functional OCR implementations. They usually consist solely of definitions of the factory methods required by the OCRModule superclass. If the OCR in question does not need a particular Component, the associated factory method simply returns NULL. Each factory method specifies which Component subclass to create, and thus establishes which implementation of each Component is used by this particular OCR Module.

The Conjecture framework (specifically, the Env class) maintains a special module field that is initialized to an instance of a subclass of OCRModule based on the -A command-line flag. This field effectively acts as a container for a collection of Component subclass instances. Anywhere in the framework, if one wants to perform segmentation, they can ask for the modulefield of the (usually unique) Env instance. The resulting object has methods for obtaining the component implementtions associated with every component, including a segment() method that returns an instance of a subclass of SegmentCompoment. The execute(Element*page) method can then be invoked on this object in order to perform segmentation. All without having to know anything whatsoever about the actual implementation details. This design allows us to completely separate individual components from one another, making for a flexible plug-and-play paradigm.

The config/Conjecture.modules File

The config/Conjecture.modules file provides a description of every OCR Module currently supported by Conjecture. To add a new OCR Module, a user simply edits this file and adds another record in the same format as those already present. This file is expected to grow as more and more Modules are implemented, so we do not present a comphrensive summary here. Instead, we'll provide a small sample file and discuss the syntax for it:

  MODULE default CLASS DefaultModule
            COMPONENT Process  IMPLEMENTATION StandardProcess;
     extern COMPONENT Segment  IMPLEMENTATION GocrSegment;
     extern COMPONENT Identify IMPLEMENTATION GocrIdentify;
     extern COMPONENT Format   IMPLEMENTATION GocrFormat;

  MODULE gocr CLASS GocrModule
     extern COMPONENT Process  IMPLEMENTATION StandardProcess;
            COMPONENT Segment  IMPLEMENTATION GocrSegment;
            COMPONENT Identify IMPLEMENTATION GocrIdentify;
            COMPONENT Format   IMPLEMENTATION GocrFormat;

  MODULE ocrad CLASS OcradModule
            COMPONENT Process  IMPLEMENTATION OcradProcess;
     no     COMPONENT Segment;
     no     COMPONENT Identify;
     no     COMPONENT Format;

  MODULE wmh CLASS WmhModule
     extern COMPONENT Process  IMPLEMENTATION StandardProcess;
            COMPONENT Segment  IMPLEMENTATION WmhSegment;
            COMPONENT Identify IMPLEMENTATION WmhIdentify;
            COMPONENT Format   IMPLEMENTATION WmhFormat;

There are four Modules defined in this sample file, with the names 'default', 'gocr', 'ocrad' and 'wmh'. Note that the public name appears immediately after 'MODULE' keyword, and is not necessarily the same as the class name (which appears after the 'CLASS' keyword). When invoking conjecture at the command line, the -A flag accepts a module name as a value.

Each OCR Module is fully described by specifying which class it will use to implement each Component. For example, GocrModule uses the StandardProcess subclass of ProcessComponent to implement the 'process' issue, and uses Gocr-specific sublcasses of SegmentComponent, IdentifyComponent and FormatComponent to deal with segmentation, identification and formatting respectively.

The bin/ocrgen Script

The ocrgen script reads the Conjecture.modules file and verifies the existence of an appropriate subdirectory of src/modules, and source/header files for the Module and Component classes provided by the Module.

... more here ...

Creating a new OCR Module

NOTE: This is a very brief description of a really important topic that needs to be much more fully addressed. Please contribute documentation if you have ideas for how to explain the process better.

Before making a module of your own, it is useful to understand how to run existing modules on images, so you should read the Usage documentation before preceeding with this discussion. This discussion assumes that Conjecture is installed and is working correctly. To determine this, try:

    prompt% cd $CONJECTURE
    prompt% make verify

If the above does not produce a table consisting of a whole bunch of plus signs ('+'), something is wrong. Inform us on the developers mailing list.

Here is the way you should create your first OCR Module. It differs somewhat from how you would create subsequent module (once you become more familiar with the Conjecture architecture), but provides a quick means of getting something up-and-running immediately.

  1. Establish a name for your module. We'll use the name test, but for your module, I'd suggest your initials. For example, my first OCR module is wmh.
  2. Edit $CONJECTURE/config/Conjecture.modules (see the discussion) by adding a new record that looks like this (or uncommenting the record already present in the file for exactly this purpose):
       MODULE test CLASS TestModule EXTENDS GocrModule
         extern COMPONENT Process  IMPLEMENTATION StandardProcess;
         new    COMPONENT Segment  IMPLEMENTATION TestSegment     EXTENDS GocrSegment;
         new    COMPONENT Identify IMPLEMENTATION TestIdentify    EXTENDS GocrIdentify;
         new    COMPONENT Format   IMPLEMENTATION TestFormat      EXTENDS GocrFormat;

    The above is informing Conjecture that you want an OCR module named test, with module class TestModule. Furthermore, it says that you want to use the (pre-existing) StandardProcess implementation for the Process component (which means you do not need to implement it at all). It also says that this new module will be providing its own implementations for the Segment, Identify and Format components. However, it also indicates that each of these new implementations will be based on the corresponding GOCR implementation (that is, that the classes representing the implementations will inherit from the gocr implementation classes). This last fact is important, because it allows us to create a new Module that will be able to perform image-to-text identification immediately, after which you can start incrementally replacing or augmenting gocr functionality as you desire.

    Note that currently Conjecture has the most support for interacting with gocr, but support for other third-party ocrs will be added when we (including you!) do it. If you are remotely familiar with the architecture of any existing open-source OCR, please let us know on the developers mailing list.

  3. Once you've edited the Conjecture.config file, do the following:
       prompt% cd $CONJECTURE/src/modules
       prompt% ls

    The src/modules directory is where all the module-specific code (primarily, implementations of components) reside. There is a subdirectory for each OCR module. Note that your new module, test, is not yet there. The abstract subdir is special (it isn't a module, but instead contains the abstract superclasses of the Module and Component hierarchies). The SomeName files are also special - they are templates for any new classes you might want to create.

  4. Now do the following:
       prompt% make modules
       prompt% ls
       prompt% ls test

    This causes Conjecture to invoke the special ocrgen script, which is responsible for parsing the $CONJECTURE/config/Conjecture.modules file (the one you just edited) and ensuring that C++ classes exist corresponding to the information present in that file. Most of the classes specified by the file already exist, but those associated with your new test module do not, so after execution, you will note that a test subdir exists, and contains a variety of classes. Furthermore, the classes already contain an entirely functional implementation (Conjecture has automatically generated appropriate method signatures and default method implementations).

  5. Now we compile the new code:
       prompt% make

    Note that typing make in the src/modules directory ended up performing a make in src. This is intentional. The makefiles in subdirectories of src usually just delegate those targets to the makefile in src, which is smart enough to compile every source file in every subdirectory, and will thus detect the new source code for the test module.

  6. Now we test your new module:
       prompt% cd ..          # directory $CONJECTURE/src
       prompt% make test.pnm  # creates a symlink to a test image for convenience
       prompt% conjecture -i test.pnm -o test.ocr -M test
       prompt% cat test.ocr

    Note that your module is producing the same output as the 'gocr' module. Which shouldn't be surprising since all of the components in your module are currently inheriting all functionality from their parent 'gocr' components.

  7. Now you can start modifying your module, by editing the component classes and adding new code.
       % cd modules/test
       % ls

    Edit the file and insert the following line as the first line in the TestIdentify::execute method.

        cerr << "NOTE: Here in TestIdentify::execute" << endl;

    Then compile and run again:

       # This command can be executed in any directory and will compile
       # the source code properly, because all Makefiles delegate to
       # $CONJECTURE/src
       prompt% make    
       # This command will only work in a directory that has a
       # test.pnm file (in our example so far, $CONJECTURE/src)
       prompt% conjecture -i test.pnm -o test.ocr -M test

    Note that running the program now prints out the line you added to your TestIdentify component, demonstrating that Conjecture really is executing your module.

  8. Now comes the hard part. You start adding code to some or all of the component classes in the modules/test directory. These changes either build up on what gocr is doing, or entirely replacing what it is doing. Of course, building on top of gocr requires knowledge on your part of the architecture and implementation details of gocr. And no matter what, you will need to understand the basics of the Element hierarchy, provided by Conjecture. This hierarchy sub-divides an image into semantically meaningful sub-regions (pages, regions, lines, words and glyphs), and must be modified in order to satisfiy the post-conditions of each component. Which also implies that you need to understand the pre and post conditions of each component, and ensure that your component implementations satisfy the post-conditions upon completion (your component implementations can assume that the pre-conditions hold).
  9. The above was just an example. If you used the name test, please do not add it into the repository (this howto won't be as useful if you do), do not commit Components.modules with the test record uncommented, and do not commit if it assumes the existence of a TestModule class. On the other hand, once you have created a new, more appropriately named, module, you are very much encouraged to commit it (along with the new Components.modules and files).

The above process is useful if you want to improve the GOCR implementation. However, if you are more comfortable with the inner workings of some other third-party open-source OCR, you would change the record you added to Conjecture.modules appropriately. However, in order to start building on top of some other third-party OCR, Conjecture must first be extended to support it. If you are interested in extending an OCR that is not currently supported by Conjecture, please let us know on the developers mailing list.

OCR Modules Currently Available in Conjecture

The current class diagram for the OCRModule hierarchy is:

  • The DefaultModule class describes the default Conjecture implementation, and will represent the current "best" overall strategy.
  • The GocrModule class relies on the GOCR library, and provides facilities for having Conjecture data-structures like Page, Glyph, etc. interact with underlying GOCR data-structures like Job and box.
  • The OcradModule provides access to the Ocrad OCR, but currently via a sub-process execution (as opposed to linking the Ocrad source code into Conjecture). This will be changed as Conjecture evolves.
  • The WmhModule is an example of an experimental module. It represents Wade's first attempt at various parts of the OCR problem (currently focusing on identification). It relies on the GOCR module to provide segmentation, but provides its own identification and formatting algorithms.

Quick Links

  Downloads : V-0.06
  Howto : Install
  Community : Mailing List
  To Do : Questions
Design Implementation Infrastructure

Conjecture is using services provided by SourceForge