In order to explain both the overall Conjecture architecture, and the rationale behind the architecture, we will first look at some initial solutions to the problem of customization. The Conjecture framework is supposed to make it easy for individuals to contribute anything from tiny within-method improvements to single-method reimplementations to entirely new algorithms to entirely new fully-functional OCRs.
Suppose we limit ourselves to just the Element hierarchy for now, and consider how an individual can extend the framework. Since the Page class defines issue-solving methods like 'segment()', 'identify()', and 'format()', one natural means of customization is to introduce a subclass of Page and redefine the methods one wants to experiment with. For example, an individual could create a MyPage class, and could redefine 'segment() in order to write a better glyph segmenting algorithm, or redefine 'identify()' in order to write a better glyph identification algorithm. This strategy is both straight-forward and intuitive. However, it also suffers from a serious drawback. Specifically, there are many possible implementations of the 'segment' algorithm, many possible implementations of the 'identify' algorithm, and in general many different implementations of all the various issues related to OCR. Suppose there are K variants for 'segment()', L variants of 'identify' and M variants of 'format()'. Using the strategy of subclassing Page and redefining relevant methods to provide extension, we would need to create K * L * M 'Page' subclasses in order to represent all combinations of the various implementations. Depending on the values of K, L and M, this number can quickly become prohibitive, and that assumes there will only be 3 issue-solving methods defined on Page (which isn't true at all - there will probably be 20 or more such methods, each with many possible implementations).
Having explained why subclassing the Element hierarchy to support extension can be problematic, we can now explain what the Conjecture framework does to address the problem. As is usually the case, adding a level of abstraction makes everything sunny and happy. Although this extra level of abstraction makes the overall architecuture more complex, and thus somewhat more difficult to grasp at first, the flexibility it provides in allowing incremental improvements is well worth the increase in complexity.
As an example, Conjecture has formalized the concept of 'segmentation' (finding Glyphs within an Image) into a SegmentComponent class hierarchy. The abstract SegmentComponent class defines an interface, and subclasses provide an implementation of that interface. Similarily, the concept of 'identification' (establishing unicode values for each Glyph) has been formalized into the IdentifyComponent, and so on. As new issues are identified, new component hierarchies will be added to the Conjecture framework.
So far, Conjecture has identified three fundamental issues: Segment, Identify and Format. However, Conjecture will be providing decompositions of each of these components into sub-components. For example, dust removal, page orientation, and font-detection are all possible sub-issues of segmentation that could be formalized into Component hierarchies.
The fact that components can be implemented in terms of other components highlights some important issues. First, it implies that there are some constraints between components, and the order in which certain components are performed may (or may not) be important, depending on the particular component(s) in question. Furthermore, we have already demonstrated the decomposition of components by implicitly assuming that every OCR will perform segmentation, identification and formatting. Although this is probably true, in the interests of completely generality, Conjecture has also formalized the entire OCR process into a ProcessModule. More on this later.
The Component-related class diagram of Conjecture currently looks like this:
Component | .-----------------.------^----------.-----------------. | | | | ProcessComponent SegmentComponent IdentifyComponent FormatComponent | | | .-----^-----. ... .------------^--------------.-----... | | | | | StandardProcess OcradProcess GocrIdentify WmhIdentify DefaultIdentify
Notice that the four issues that Conjecture has currently formalized exist as subclass hierarchies of an abstract Component class. The ProcessComponent class defines an interface (collection of methods) that together are responsible for the entire optical character recognition process. Currently, the interface for all classes consists of a single 'execute(Page* page)' method, although more methods may be added to individual Component hierarchies as the framework evolves.
Subclasses of the Component hierarchies provide alternative implementations. For example, the StandardProcess subclass of ProcessComponent provides what will probably be the most common implementation of the Process component - perform segementation, then identification, then formatting. This default implementation knows that the concepts of segmentation, identification and formatting have been formalized into Components of their own, and thus, when it wants to perform segmentation, asks the Conjecture environment for an object that implements the SegmentComponent interface. Once it has obtained this object, it can invoke the 'execute(Page*page)' method on it (because it knows that this method is part of the SegmentComponent interface). The StandardProcess class needs to know absolutely nothing about the actual subclass of SegmentComponent being used in order to provide segmentation. This means that the StandardProcess class can work with any SegmentComponent (and any IdentifyComponent and FormatComponent), making the framework highly modular and flexible.
The OcradProcess subclass of ProcessComponent does something different than StandardProcess. Instead of decomposing the problem into segmentation, identification and formatting, this class instead invokes the "ocrad" executable as a subprocess. Whether 'ocrad' decomposes the problem into segmentation, identification and formatting is irrelevant from the perspective of the OcradProcess class. All it knows is that it is responsible for taking an input image and producing output text. It knows that the 'ocrad' executable can do that, and thus, by invoking 'ocrad', it satisfies its mandate.
The OcradProcess is interesting because it demonstrates how Conjecture can be used to interact with arbitrary third-party OCRs (including even commercial ones, as long as they have a command-line interface). Note also that the current implementation of OcradProcess will be changing soon. Although delegation to an executable is sometimes useful, Conjecture is about identifying and extending existing algorithms for solving OCR problems, and as such it is much better to link the Ocrad source code into Conjecture so that programmers have access to the internals and can incrementally improve upon or borrow from its implementation in the creation of new algorithms.
We will discuss the Component hierarchy in more detail later. However, it will first be useful to step back and look at the bigger picture (pun fully intended :-) by discussing the OCR Module Hierarchy.
Conjecture provides support for comparing, sharing and interaction between an unlimited number of OCRs. This is accomplished by defining a OCR Module hierarchy. Each class in this hierarchy represents a fully-functional OCR capable of taking an image as input and producing text as output. Each such class must deal with input image processing (noise removal, orientation adjustment, etc.), segmentation into glyphs, identification of unicodes for each glyph, the production of properly formatted output, and everything else that goes into an OCR. However, these classes almost never do so directly. Instead, they almost always delegate responsibility for implementing individual issues to a specific subclass within the Component hierarchy associated with that issue.
In fact, classes in the OCR Module hierarchy are usually very simple, and are almost always automatically generated by the Conjecture infrastructure based on an entry in a special input file (see the discussion on Conjecture.modules
).
The OCR Module hierarchy has an abstract class called, not surprisingly, OCRModule, as its root. This class defines a special interface consisting of a collection of Factory Methods (methods responsible for creating objects). One Factory Method exits for each Component that Conjecture has identified and formalized. The OCRModule class also defines one field for each component, and during creation uses the factory methods in order to assign instances to each of these fields.
Subclasses of the abstract OCRModule class represent concrete, fully-functional OCR implementations. They usually consist solely of definitions of the factory methods required by the OCRModule superclass. If the OCR in question does not need a particular Component, the associated factory method simply returns NULL. Each factory method specifies which Component subclass to create, and thus establishes which implementation of each Component is used by this particular OCR Module.
The Conjecture framework (specifically, the Env class) maintains a special 'ocr' field that is initialized to an instance of a subclass of OCR based on the -A
command-line flag. This field effectively acts as a container for a collection of Composite subclass instances. Anywhere in the framework, if one wants to perform segmentation, they can ask for the 'ocr' field of Env, and from this field invoke the public 'segment()' method, which returns a subclass of SegmentComponent. The 'execute(Page*page)' method can then be invoked on this object in order to perform segmentation. All without having to know anything whatsoever about the actual implementation details. This design allows us to complete separate individual components from one another, making for a flexible plug-and-play paradigm.
The current class diagram for the OCRModule hierarchy is:
OCRModule | .---------------------------^----------------------.-------------. --- ... | | | | | DefaultModule GocrModule OcradModule WmhModule ...
The config/Conjecture.modules
file provides a description of every OCR Module currently supported by Conjecture. To add a new OCR Module, a user simply edits this file and adds another record in the same format as those already present. This file is expected to grow as more and more Modules are implemented, so we do not present a comphrensive summary here. Instead, we'll provide a small sample file and discuss the syntax for it:
MODULE default CLASS DefaultModule ALGORITHM Process STRATEGY StandardProcess; extern ALGORITHM Segment STRATEGY GocrSegment; extern ALGORITHM Identify STRATEGY GocrIdentify; extern ALGORITHM Format STRATEGY GocrFormat;
MODULE gocr CLASS GocrModule extern ALGORITHM Process STRATEGY StandardProcess; ALGORITHM Segment STRATEGY GocrSegment; ALGORITHM Identify STRATEGY GocrIdentify; ALGORITHM Format STRATEGY GocrFormat;
MODULE ocrad CLASS OcradModule ALGORITHM Process STRATEGY OcradProcess; no ALGORITHM Segment; no ALGORITHM Identify; no ALGORITHM Format;
MODULE wmh CLASS WmhModule extern ALGORITHM Process STRATEGY StandardProcess; ALGORITHM Segment STRATEGY WmhSegment; ALGORITHM Identify STRATEGY WmhIdentify; ALGORITHM Format STRATEGY WmhFormat;
There are four Modules defined in this sample file, with the names 'default', 'gocr', 'ocrad' and 'wmh'. Note that the public name appears immediately after 'MODULE' keyword, and is not necessarily the same as the class name (which appears after the 'CLASS' keyword). When invoking conjecture
at the command line, the -A
flag accepts a module name as a value.
Each OCR Module is fully described by specifying which class it will use to implement each Component. For example, GocrModule uses the StandardProcess subclass of ProcessComponent to implement the 'process' issue, and uses Gocr-specific sublcasses of SegmentComponent, IdentifyComponent and FormatComponent to deal with segmentation, identification and formatting respectively.
The ocrgen
script reads the Conjecture.modules<code> file and verifies the existence of an appropriate subdirectory of src/custom
, and source/header files for the Module and Component classes provided by the Module.
... more here ...