An Overview of Modules
This documentation discusses the fundamental concept of a
OCR Module, abbreviated simply as Module. It also
discusses how Conjecture combines a collection of Components into a Module, so an
understanding of components is useful when reading this page
(unfortunately, an understanding of modules is also useful when
reading about components :-)
Conjecture provides support for analyzing, assessing,
comparing, and modifying an unlimited number of OCRs. This is
accomplished by defining an OCR Module hierarchy. Each
class in this hierarchy represents a fully-functional OCR capable
of taking an image as input and producing text as output. Each
class in this hierarchy must deal with input image processing
(noise removal, orientation adjustment, etc.), segmentation into
glyphs (and words and lines and columns), identification of
unicodes for each glyph, the production of properly formatted
output, and everything else that goes into an OCR. However, these
classes almost never do all of this work directly. Instead, they
usually delegate responsibility for implementing individual issues
to an Implementation of the Component
representing that issue.
To clarify this more, suppose we want to create a new module.
The name of this module is important both because it usually
establishes the name of the class that will implement it, and also
establishes the value to pass as the value of the command-line
-A flag to indicate that this OCR module should be
used to perform image-to-text translation. For our example module,
we will choose the name 'test' (the value to provide to the
-A flag), and use the class TestModule .
Now, it is possible to implement the entire image-to-text
functionality using the TestModule class itself
(along with any number of helper classes, of course). However, by
implementing all of the code for the OCR in this fashion, it makes
it difficult to experiment with alternative strategies for the
various sub-components making up an OCR. The implementation
becomes "rigid" and inflexible.
A much more flexible approach is to identify the various
subcomponents, formalize them into Component class
hierarchies (in which the root class provides an interface
and subclasses represent alternative implementations), and have
TestModule delegate responsibility for performing the
various actions necessary to appropriate component implementation
instances. This is the strategy choosen for the Conjecture
framework.
Classes in the OCR Module hierarchy are usually very simple. So
simple, in fact, that they can be automatically generated by the
Conjecture infrastructure based on an entry in a special input
file (see the discussion on Conjecture.modules).
The OCR Module hierarchy has an abstract class called, not
surprisingly, OCRModule , as its root. This class
defines a special interface consisting of a collection of Factory
Methods (methods responsible for creating objects). One Factory
Method exits for each Component that Conjecture has identified and
formalized. The OCRModule class also defines one
field for each component. During creation, OCRModule
uses the factory methods (redefined in subclasses to provide
subclass-specific component implementation instances) to obtain
instances to initialize these fields with.
Subclasses of the abstract OCRModule class
represent concrete, fully-functional OCR implementations. They
usually consist solely of definitions of the factory methods
required by the OCRModule superclass. If the OCR in
question does not need a particular Component, the associated
factory method simply returns NULL . Each factory
method specifies which Component subclass to create, and thus
establishes which implementation of each Component is used by this
particular OCR Module.
The Conjecture framework (specifically, the Env
class) maintains a special module field that is
initialized to an instance of a subclass of OCRModule
based on the -A command-line flag. This field
effectively acts as a container for a collection of Component
subclass instances. Anywhere in the framework, if one wants to
perform segmentation, they can ask for the
module field of the (usually unique) Env
instance. The resulting object has methods for obtaining the
component implementtions associated with every component,
including a segment() method that returns an instance
of a subclass of SegmentCompoment . The
execute(Element*page) method can then be invoked on
this object in order to perform segmentation. All without having
to know anything whatsoever about the actual implementation
details. This design allows us to completely separate individual
components from one another, making for a flexible plug-and-play
paradigm.
The config/Conjecture.modules File
The config/Conjecture.modules file provides a
description of every OCR Module currently supported by
Conjecture. To add a new OCR Module, a user simply edits this
file and adds another record in the same format as those
already present. This file is expected to grow as more and
more Modules are implemented, so we do not present a
comphrensive summary here. Instead, we'll provide a small
sample file and discuss the syntax for it:
MODULE default CLASS DefaultModule
COMPONENT Process IMPLEMENTATION StandardProcess;
extern COMPONENT Segment IMPLEMENTATION GocrSegment;
extern COMPONENT Identify IMPLEMENTATION GocrIdentify;
extern COMPONENT Format IMPLEMENTATION GocrFormat;
MODULE gocr CLASS GocrModule
extern COMPONENT Process IMPLEMENTATION StandardProcess;
COMPONENT Segment IMPLEMENTATION GocrSegment;
COMPONENT Identify IMPLEMENTATION GocrIdentify;
COMPONENT Format IMPLEMENTATION GocrFormat;
MODULE ocrad CLASS OcradModule
COMPONENT Process IMPLEMENTATION OcradProcess;
no COMPONENT Segment;
no COMPONENT Identify;
no COMPONENT Format;
MODULE wmh CLASS WmhModule
extern COMPONENT Process IMPLEMENTATION StandardProcess;
COMPONENT Segment IMPLEMENTATION WmhSegment;
COMPONENT Identify IMPLEMENTATION WmhIdentify;
COMPONENT Format IMPLEMENTATION WmhFormat;
There are four Modules defined in this sample file, with the
names 'default', 'gocr', 'ocrad' and 'wmh'. Note that the public
name appears immediately after 'MODULE' keyword, and is not
necessarily the same as the class name (which appears after the
'CLASS' keyword). When invoking conjecture at the
command line, the -A flag accepts a module name as a
value.
Each OCR Module is fully described by specifying which class
it will use to implement each Component. For example, GocrModule
uses the StandardProcess subclass of ProcessComponent to implement
the 'process' issue, and uses Gocr-specific sublcasses of
SegmentComponent, IdentifyComponent and FormatComponent to deal
with segmentation, identification and formatting respectively.
The bin/ocrgen Script
The ocrgen script reads the
Conjecture.modules file and verifies the existence of an
appropriate subdirectory of src/modules , and source/header
files for the Module and Component classes provided by the Module.
... more here ...
Creating a new OCR Module
NOTE: This is a very brief description of a really important
topic that needs to be much more fully addressed. Please contribute
documentation if you have ideas for how to explain the process better.
Before making a module of your own, it is useful to understand
how to run existing modules on images, so you should read the
Usage documentation before preceeding with
this discussion. This discussion assumes that Conjecture is
installed and is working correctly. To determine this, try:
prompt% cd $CONJECTURE
prompt% make verify
If the above does not produce a table consisting of a whole
bunch of plus signs ('+'), something is wrong. Inform us on the developers
mailing list.
Here is the way you should create your first OCR Module. It
differs somewhat from how you would create subsequent module (once
you become more familiar with the Conjecture architecture), but
provides a quick means of getting something up-and-running
immediately.
- Establish a name for your module. We'll use the name
test , but for your module, I'd suggest your initials.
For example, my first OCR module is wmh .
- Edit
$CONJECTURE/config/Conjecture.modules
(see the discussion) by adding
a new record that looks like this (or uncommenting the
record already present in the file for exactly this purpose):
MODULE test CLASS TestModule EXTENDS GocrModule
extern COMPONENT Process IMPLEMENTATION StandardProcess;
new COMPONENT Segment IMPLEMENTATION TestSegment EXTENDS GocrSegment;
new COMPONENT Identify IMPLEMENTATION TestIdentify EXTENDS GocrIdentify;
new COMPONENT Format IMPLEMENTATION TestFormat EXTENDS GocrFormat;
The above is informing Conjecture that you want an OCR module
named test , with module class TestModule .
Furthermore, it says that you want to use the (pre-existing)
StandardProcess implementation for the
Process component (which means you do not need to
implement it at all). It also says that this new module will be
providing its own implementations for the Segment ,
Identify and Format components. However,
it also indicates that each of these new implementations will be
based on the corresponding GOCR implementation (that is, that the
classes representing the implementations will inherit from the gocr
implementation classes). This last fact is important, because it
allows us to create a new Module that will be able to perform
image-to-text identification immediately, after which you can start
incrementally replacing or augmenting gocr functionality as you
desire.
Note that currently Conjecture has the most support for
interacting with gocr, but support for other third-party ocrs will
be added when we (including you!) do it. If you are remotely
familiar with the architecture of any existing open-source OCR,
please let us know on the developers
mailing list.
- Once you've edited the
Conjecture.config file,
do the following:
prompt% cd $CONJECTURE/src/modules
prompt% ls
The src/modules directory is where all the
module-specific code (primarily, implementations of components)
reside. There is a subdirectory for each OCR module. Note that your
new module, test , is not yet there. The
abstract subdir is special (it isn't a module, but
instead contains the abstract superclasses of the Module and
Component hierarchies). The SomeName files are also
special - they are templates for any new classes you might want to
create.
- Now do the following:
prompt% make modules
prompt% ls
prompt% ls test
This causes Conjecture to invoke the special ocrgen
script, which is responsible for parsing the
$CONJECTURE/config/Conjecture.modules file (the one you
just edited) and ensuring that C++ classes exist corresponding to
the information present in that file. Most of the classes specified
by the file already exist, but those associated with your new
test module do not, so after execution, you will note
that a test subdir exists, and contains a variety of
classes. Furthermore, the classes already contain an entirely
functional implementation (Conjecture has automatically generated
appropriate method signatures and default method implementations).
- Now we compile the new code:
prompt% make
Note that typing make in the src/modules
directory ended up performing a make in
src . This is intentional. The makefiles in
subdirectories of src usually just delegate those
targets to the makefile in src , which is smart enough
to compile every source file in every subdirectory, and will thus
detect the new source code for the test module.
- Now we test your new module:
prompt% cd .. # directory $CONJECTURE/src
prompt% make test.pnm # creates a symlink to a test image for convenience
prompt% conjecture -i test.pnm -o test.ocr -M test
prompt% cat test.ocr
Note that your module is producing the same output as the 'gocr'
module. Which shouldn't be surprising since all of the components in
your module are currently inheriting all functionality from their
parent 'gocr' components.
- Now you can start modifying your module, by editing the
component classes and adding new code.
% cd modules/test
% ls
Edit the TestIdentify.cc file and insert the
following line as the first line in the
TestIdentify::execute method.
cerr << "NOTE: Here in TestIdentify::execute" << endl;
Then compile and run again:
# This command can be executed in any directory and will compile
# the source code properly, because all Makefiles delegate to
# $CONJECTURE/src
prompt% make
# This command will only work in a directory that has a
# test.pnm file (in our example so far, $CONJECTURE/src)
prompt% conjecture -i test.pnm -o test.ocr -M test
Note that running the program now prints out the line you added
to your TestIdentify component, demonstrating that Conjecture really
is executing your module.
- Now comes the hard part. You start adding code to some or all
of the component classes in the
modules/test directory.
These changes either build up on what gocr is doing, or entirely
replacing what it is doing. Of course, building on top of gocr
requires knowledge on your part of the architecture and
implementation details of gocr. And no matter what, you will need to
understand the basics of the Element hierarchy, provided by
Conjecture. This hierarchy sub-divides an image into semantically
meaningful sub-regions (pages, regions, lines, words and glyphs),
and must be modified in order to satisfiy the post-conditions of
each component. Which also implies that you need to understand the
pre and post conditions of each component, and ensure that your
component implementations satisfy the post-conditions upon
completion (your component implementations can assume that the
pre-conditions hold).
- The above was just an example. If you used the name
test , please do not add it into the repository (this
howto won't be as useful if you do), do not commit
Components.modules with the test record uncommented,
and do not commit Modules.cc if it assumes the
existence of a TestModule class. On the other hand, once you have
created a new, more appropriately named, module, you are very much
encouraged to commit it (along with the new
Components.modules and Modules.cc files).
The above process is useful if you want to improve the GOCR
implementation. However, if you are more comfortable with the inner
workings of some other third-party open-source OCR, you would change
the record you added to Conjecture.modules
appropriately. However, in order to start building on top of some
other third-party OCR, Conjecture must first be extended to support
it. If you are interested in extending an OCR that is not currently
supported by Conjecture, please let us know on the developers
mailing list.
OCR Modules Currently Available in Conjecture
The current class diagram for the OCRModule hierarchy is:
- The
DefaultModule class describes the default
Conjecture implementation, and will represent the current "best"
overall strategy.
- The
GocrModule class relies on the GOCR library,
and provides facilities for having Conjecture data-structures like
Page, Glyph, etc. interact with underlying GOCR data-structures like
Job and box.
- The
OcradModule provides access to the Ocrad OCR,
but currently via a sub-process execution (as opposed to linking the
Ocrad source code into Conjecture). This will be changed as
Conjecture evolves.
- The
WmhModule is an example of an experimental
module. It represents Wade's first attempt at various parts of
the OCR problem (currently focusing on identification). It
relies on the GOCR module to provide segmentation, but provides
its own identification and formatting algorithms.
|
|