An Extensible Optical Character Recognition Framework


A collection of things to do, divided into various categories. Some of these things will be trivial for you to help address - please contribute where you can.

Easy Issues (Low-hanging fruit)

Input Images
 - Convert $CONJECTURE/harness/db/*.pnm to max intensity 255.
   The 4x6.pnm, etc. files have max intensity 65535 and ocrad
   fails on them (poor dumb ocrad!).  We really don't need
   65535 anyway, so it isn't a problem to conver them.

HTML Pages
 - The current web page structure is using TABLE for structuring. We
   want to switch over to using DIV to make things more flexible.
   If you have experience with CSS, contact bruno and wade so we can
   colloborate on a redesign.

 - Related to the above, we are currently using BLOCKQUOTE for
   indentation within sections. Using CSS will be much better.
   We need a CSS-knowledgable contributor!

 - Figure out why indentation of paragraphs after paragraphs isn't
   working properly. I know it is trivial in CSS, but am obviously
   doing something wrong. See the P P entry in style.css and tell
   me what to fix.

 - A new format for describing existing Components is needed. This
   is the old-style format that needs to be made more readable.

 - A new format for the TODO page (the page you are reading right now)
   must be designed. However, it should be very easy for people to add
   new entries, so requiring a complex HTML structure for each entry
   is not desirable. Some combination of PHP and CSS might make it
   look better and still make it easy for people to specify content
   without having to typeset.

 - How do we stop HTML from pushing the Quick Links sidebar further
   left when the page contains PRE or IMG elements that cannot wrap?
   Once the Quick Links is a DIV element, some sort of
   relative-to-right margin positioning request? But we want text to
   flow around it. I some handy embedded figure CSS code I'll dig out
   - maybe that will work nicely.

High-level Documentation

 - If you notice typos or other kinds of errors in any documentation,
   please fix them! New users are the ones most likely to detect such
   errors, and also most in need of clear, concise, accurate
   explanations, so please help maintain the documentation. High level
   documentation is in $CONJECTURE/doc/web. Class and
   method-level documentation is in the header files in

 - Write a tutorial for new Modules, with a simple example that can
   be used as a guide for new developers.

Class Documentation
 - We need to insert a header in every source file. Presumably SVN has
   substitution variables like CVS does? Who knows what variables to

Compiler Warnings
 - [wade] On cygwin, use of 'std::list<...>::push_back() generates
   compiler warnings of the following form:
      warning: '__p' might be used uninitialized in this function
   Anyone see the reason?  Line 435 of my cygwin platorms stl_list.h is
      _Node* __p = this->_M_get_node();
   which doesn't seem to be a candidate for such a warning (if anything,
   the field returned by '_M_get_node()' should be being reported as
   potentially unused, since '__p' itself cannot be uninitialized unless
   the return of '_M_get_node' is uninitialized.

   Google reports a few others having this problem, but no solutions as far
   as I could tell.

Design Issues

 - Many more components must be identified and formalized!
 - Currently we only have the "broad strokes".
   ProcessComponent provides total control over the OCR
   implementation. SegmentComponent does Page-to-Glyph
   segmentation (and optionally intermediate segmentation).
   IdentifyComponent does Glyph-to-unicode translation.
   FormatComponent does text placement (words, lines,
   columns, etc.).

   ProcessComponent is almost always implemented by executing a
   SegmentComponent, then an IdentifyComponent, then a
   FormatComponent. That is, components can be composed of a number of
   sub-components (but don't have to be!).

   The next critical design issue wrt Components is to subdivide each
   of Segment, Identify and Format into sub components. For example,
   there should be a DeskewComponent, NoiseRemovalComponent, and
   numerous other components within one or more of these "high-level"

   Identifying which issues should be formalized into Components is
   not always easy. For example, should thresholding and pixel
   filtering be separate components, or should they be part of a
   single component? Part of the fun of Conjecture is exploring the
   best design!

Specialized Modules

 - How do we go about representing specialized modules (those
   modules designed to only work for hand-written text, or only
   for numbers, or only for dot-matrix input, etc. etc.). 

 - One possibility is to define subclasses of Module representing
   these specializations. But is this the best solution? Or should the
   individual Component hierarchies contain specialized
   abstract subclasses to deal with this issue?

Things To Do Related to Implementation

 - Must decide what to do about exceptions. Currently just throwing
   'string' instances, but a much cleaner solution involving a
   hierarchy of exceptions classes is needed eventually. Don't want to
   require Boost until it becomes the standard.

Naming Issues

 - Find better names for certain classes:
    - The 'Region' class is a candidate for renaming.
    - The 'Env'    class is a candidate for renaming.

GOCR Documentation
 - The Gocr Module provides an interface to the GOCR codebase.
   However, although gocr can provide good results, its codebase is
   very poorly documented.
    - Provide an explanation of what each command-line flag does
    - Provide an overview discussion of the flow-of-control in
      gocr and the important methods and files.
    - Provide a detailed discussion of the critical methods
      like ocr0_eE(), etc.
    - All of this documentation should be placed in the
      src/modules/gocr directory (either in source files or other
      discussion files).

Things To Do Related to Infrastructure

 - Determine whether we are staying with the name Conjecture, or
   whether we will be changing the name to something else.

 - During initial development bruno and wade explored various names,
   the favored being Apocryphal and Sorcerer. However, due to the
   negative (if amusing) connotations of Apocryphal, and existing
   image analysis software named Sorcerer, we decided to go with
   Conjecture as our "temporary" name.

 - A name that has the letters 'o', 'c', and 'r' in it seems
   appropriate.  The following one-liner:
       % perl -ne 'next if substr($_,0,6) eq substr($last,0,6); 
                   if ( m/(.*)o(.*)c(.*)r(.*)/ ) { 
                      printf("%-20s = %s\n", $_, "$1O$2C$3R$4");
                      $last = $_;
                    }' /usr/share/dict/words
   produces a candidate set of possible names.  Editing it
   based on double entrendres wrt OCR or amusement value gives the
   following possibilities:
        apocryphal           = apOCRyphal
        aristocracy          = aristOCRacy
        autoincrement        = autOinCRement
        binocular            = binOCulaR
        boxcar               = bOxCaR
        cockroach            = cOCkRoach
        concentrate          = cOnCentRate
        concord              = cOnCoRd
        concur               = cOnCuR
        confectionery        = cOnfeCtioneRy
        conjecture           = cOnjeCtuRe
        consecrate           = cOnseCRate
        constructor          = cOnstruCtoR
        democracy            = demOCRacy
        doctor               = dOCtoR
        hypocrisy            = hypOCRisy
        idiosyncrasies       = idiOsynCRasies
        jockstrap            = jOCkstRap
        mediocre             = mediOCRe
        molecular            = mOleCulaR
        nomenclature         = nOmenClatuRe
        obfuscatory          = ObfusCatoRy
        obscure              = ObsCuRe
        octahedra            = OCtahedRa
        omicron              = OmiCRon
        orchard              = OrChaRd
        orchestra            = OrChestRa
        popcorn              = pOpCoRn
        postscript           = pOstsCRipt
        processor            = prOCessoR
        rhinoceros           = rhinOCeRos
        scorcher             = scOrCheR
        scorecard            = scOreCaRd
        soccer               = sOcCeR
        Socrates             = SOCRates
        sorcerer             = sOrCereR
        Velociraptor         = VelOCiraptoR
        woodpecker           = woOdpeCkeR

Web Pages

 - The web pages always need more work. If you are reading
   documentation for the first time, and something isn't making sense,
   make a note of it! When it starts making sense, go back and edit
   the documentation to make things more clear!

 - Set up a proper installation environment
    - autoconf (assigned to Bruno)
    - maintain the $CONFIGURE/external/Makefile,
      which should automatically download/compile/install third-party
      source code useful to Conjecture.

Platform Support

 - Verify that code compiles and works on common platforms

    - I have been testing on linux and cygwin machines

    - The stupid case-insensitive file directory convention
      on Windows has posed problems on more than one occasion.

 - Decide when to move the codebase from
   the two temporary mirrors
   to someplace more formal.

 - Do we want to go to sourceforge?  Bruno says sourceforge has
   SVN - that's good!

Test Harness
 - The test harness is meant to be an (ever expanding) collection
   of images and expected output, and an infrastructure for testing
   the output of 'conjecture' against that expected output.

 - Extend the test harness

    - Obtain new images and add them to the harness/db directory
      (for now, lets keep them relatively small, and in .pnm

    - Create 'harness/db/<name>.valid' file by typing in the
      text that a 100% accurate OCR would produce. If you can automate
      this process, please let us know :-) :-) :-) hee hee.

    - Create harness/<module>/<variant>/<name>.val using
      ocrtest -u.

    - Modify harness/Makefile (or generalize the makefile to look
      for all .pnm files!)

 - Since the image database can end up growing arbitrarily large, we
   will need to sub-divide it into self-contained "image bundles".
   Some bundles can be specific to a certain kind of image
   (handwritten, large font, numbers only, etc.), while other bundles
   will contain many different kinds of images.

    - Not all of of these bundles will be placed in the SVN
      repository. Instead, a "code" bundle will be in the repository,
      and additional bundles can be downloaded from the website as
      desired. Makefile and script support for dealing with
      alternative databases can be added (i.e. add a -D flag to
      ocrtest to specify which non-standard image bundle
      to use during testing).

Feature Wishlist

This is a place to jot down ideas for desired features. However, once they have been explored more thoroughly, they will be moved into the Design/Implementation/Infrastructure sections.


Quick Links

  Downloads : V-0.06
  Howto : Install
  Community : Mailing List
  To Do : Questions
Design Implementation Infrastructure

Conjecture is using services provided by SourceForge