An Extensible Optical Character Recognition Framework

Conjecture Terminology

A formalization of an issue that needs to be addressed in order to provide image-to-text conversion. Each component has a canonical name and generates an entire class hierarchy within the Conjecture framework. For example, one issue of importance is 'glyph identification', which has been formalized into the IdentifyComponent component. The IdentifyComponent class provides an interface that "resolves" or "addressed" the issue. Subclasses of IdentifyComponent provide alternative implementations of this interface. A particular Module will use a specific subclass of IdentifyComponent to provide glyph identification.
component implementation
For a component <Name>, an implementation is a subclass of the <Name>Component interface.
glyph segmentation
The identification of the regions within a Page representing individual Glyphs. Glyph segmentation is not the same as Glyph identification, but instead necessarily preceeds it.
glyph identification
The process of converting a graphic image representing a character (a Glyph) into a unicode value. It is at the core of any OCR program. Its efficacy depends not only on its internal strategies, but also the accuracy of Glyph segmentation. If a region is identified as a Glyph, but is instead only part of a glyph, or multiple glyphs, the accuracy of identification will inevitably decline.
The direction that aligns glyphs within a line. Normally, this is just the x direction, but if the image is rotated, then 'horizontal' refers to the x direction after this rotation has been compensated for.
Some concept or problem of importance during optical character recognition. A conceptual term. Examples of issues include segmentation, identification, formatting, dust removal, line angle detection, etc.
line segmentation
The identification of the regions within a Page representing individual Lines of horizontally adjacent Glyphs. Line segmentation may occur before or after Glyph segmentation, depending on the overall segmentation strategy employed.
This term is used in two different contexts within Conjecture. It can mean "optical character recognition" (a verb), or it can mean "optical character recognizer" (a noun). The OCR class hierarchy uses the noun semantics - each subclass of OCR is an optical character recognizer (that performs optical character recognition :-) Optical character recognition involves taking an image as input, and producing formatted text as output.
Synonynm for 'implementation'
The direction orthogonal to horizontal. It too is affected by page rotation issues.

