An Extensible Optical Character Recognition Framework

The Kernel Classes

The Conjecture framework defines a core collection of C++ classes that represent the fundamental objects that any OCR needs. The kernel classes consist primarily of Image and the Element hierarchy, but also includes various other fundamental classes. New classes will be considered for inclusion in the kernel based on their general utility.

The Element Class Hierarchy

The Element hierarchy is at the absolute center of OCR processing. Instances of the classes in this hierarchy are used to describe the fundamental concepts that an OCR will operate on.

Class Summary

Root Abstract superclass of all classes in the Conjecture framework.
Env A class providing a public interface to other programs.
Image A collection of pixels. In-memory representation of a file in a known graphics format.
Element Abstract superclass of a collection of classes representing a part-whole decomposition of a graphic image into smaller and smaller semantic units.
Page An Image and associated meta-data representing an entire page of to-be-scanned data. May be sub-divided into Regions, Lines, Words and/or Glyphs, although only Glyphs are crucial.
Region A sub-region of a Page, usually used to represent the graphical area for a single column in a multi-column image. May be sub-divided into Lines, Words, and/or Glyphs. This class is not currently prioritized and may be deleted if deemed unnecessary.
Line A sub-region of a Page (or Region) consisting entirely of Glyphs. My be sub-divided into Words that contain Glyphs, but may often contain only Glyphs.
Word A sub-region of a Line consisting entirely of Glyphs, separated from other Words within the Line by more horizontal space than
Glyph A sub-region of a Page (or Region or Line or Word) representing a single to-be-identified character.

The Distinction between the is-a and has-a Element hierarchies

Note that the Element class is an aggregation of one or more instances of itself. This design allows for a great deal of flexibility, and significant code reuse between Element and its subclasses. However, the design does mean that there is both an Element is-a hierarchy (the collection of classes shown above and how they are hierarhically related) and an Element has-a hierarchy (the collection of specific objects and how they are connected to one another). It is important to keep this distinction between the two hierarchies clear. To this end, in Conjecture documentation we will refer to the Element class hierarchy (for the is-a relationship) and to the Element containment hierarchy (for the has-a relationship).

Element defines both a 'parent' field (pointing to the Element within which this Element is contained) and a 'parts' field (containing all the "child" elements). Note that in this context, the terms 'parent' and 'child' do NOT refer to the class hierarchy, but instead to the containment (object) hierarchy. For example, if we say that a Page contains two Regions, each of which contains 40 Lines, each of which contains a variety of Words, each of which contains a variety of Glyphs, we are talking about the hierarchy of objects, not the hierarchy of classes. At the class level, Pages, Regions, Lines, Words and Glyphs are all "equal" subclasses of Element, but at the object level, there is an inherent asymmetry due to the semantics of the classes; a Page can contain Regions or Lines or Words or Glyphs, but a Glyph cannot contain a Word or Line or Region or Page, a Line cannot contain a Region or Page, etc.

One way to enforce this asymmetry would have been to NOT define a 'parts' field in Element (containing Elements), but instead to have the Page class define a 'regions' field containing Region instances, have Region define a 'lines' field containing Line instances, have Line define a 'words' field containing Word instances, etc. Although this alternative allows for more specificity, and thus better compile-time type-checking, it is also very constraining because it forces every OCR to always divide each Page into Regions into Lines into Words into Glyphs. Because Conjecture is attempting to be a universal framework that can support any OCR implementation imaginable, we do not want to place unnecessary restrictions on how an implementation performs its duties. All that is strictly needed is to divide Pages into Glyphs (the subdivision into Word, Line and Region are not strictly necesssary).

So, instead of using the above strategy, Element defines a 'parts' field, which allows Pages to contain Regions, Lines, Words or Glyphs, allows Regions to contain Lines, Words, or Glyphs, and allows Lines to contain Words or Glyphs, which is more flexibile. Of course, from a compile-time perspective this design is problematic because nonsensical containments are also possible (a Glyph could contain a Word or Line or Region or Page, a Line could contain a Region or Page, etc). However, this is easily addressed with some run-time checks in the 'element-adding' functionality defined on the Element class. The increased flexibility of this approach was deemed worth the reduction in compile-time typechecking accuracy.

How the Element Hierarchy interacts with Image

The overall idea behind Elements is a part-whole decomposition of an input image into smaller and smaller images. The manner in which this decomposition is implemented can have significant time and space efficiency ramifications.

One approach is to have each Element subclass maintain a local copy of that portion of the overall image to which it applies. Fields 'height' and 'width' would establish the pixel dimensions, and a 'data' field could store the actual pixel information. For example, each Glyph could maintain a copy of that portion of the Page represented by the Glyph.

The problem with the above naive implementation is that it incurs more and more memory the more sub-divided an Element becomes. Although it is common for Pages to contain just a collection of Glyphs, it is also possible for a Page to contain Regions that contain Lines that contain Words that contain Glyphs (which might contain other Glyphs!). In such a situation, the pixel data representing an individual Glyph would be copied (and maintained in memory) up to 5 times (stored in a Glyph, stored in a Word, stored in a Line, stored in a Region, and stored in the Page.

The above memory impact can be avoided by taking a different approach. Instead of each Element maintaining a separate copy of its portion of an image, we note that each input image corresponds to a Page, and thus a Page represents the largest image. In the containment hierarchy, all other subclasses of Element eventually "belong" to a Page. For this reason, if Page stores an Image instance, then every Element subclass has access to that "big picture" image by following its 'parent' field up until a Page is reached (which is guaranteed by the Conjecture framework to always occur). Each Element can be thought of as representing a specific rectangular region within that "big picture" image associated with the Page.

By having each Element store the top-left and bottom-right coordinates (relative to the "big-picture" image), we can avoid requiring each element to maintain individual copies. It is for this reason that the Element hierarchy does NOT inherit from Image, the reason the Element class itself does not define an Image field, and the reason the Page class has an Image field. This approach significantly reduces memory costs and thus efficiency.

Quick Links

  Downloads : V-0.06
  Howto : Install
  Community : Mailing List
  To Do : Questions
Design Implementation Infrastructure

Conjecture is using services provided by SourceForge