An Extensible Optical Character Recognition Framework


This document provides some discussion on how to use the conjecture executable and the numerous supporting scripts provided by the Conjecture framework.


This discussion assumes that you have already installed Conjecture, have added $CONJECTURE/install/bin to your PATH, and have added $CONJECTURE/install/lib/perl to your PERLLIB. Let's perform som simple checks; the following commands ensure that you have set your PATH properly. If you get errors, or if the paths reported aren't $CONJECTURE/install/bin/*, there is a problem.

   prompt% cd $CONJECTURE
   prompt% which conjecture
   prompt% which ocrdiff
   prompt% which ocrfind
   prompt% which ocrgen

The following checks if you've set your PERLLIB correctly. It will take up to a minute to finish (but once it is, ocrfind will thereafter be very efficient). Here, we are just using it to test PERLLIB. We'll talk more about its purpose later. If you get an error when you execute this script, it probably means that your PERLLIB does not include $CONJECTURE/install/lib/perl.

   % ocrfind -u

The following shows the database of images available (it is currently very small). We (you :-) will be growing this database over time. If you have images of interest, please feel fre submit them (with some caveats).

   prompt% cd harness/db

The following assumes you have the display program on your machine. Any program that can display .pnm files will work just as well, of course (xv, gimp, etc.). The rod.pnm file is my current preferred test image (for D&D enthusiasts it might be recognizable), simply because it is the first image I tried to process when I looked at gocr.

   prompt% display rod.pnm &

The following command invokes the conjecture executable, specifying rod.pnm as the input image file, and rod.ocr as the output text file. NOTE: Conjecture usually uses the .ocr suffix to represent text output from an execution of conjecture).

   prompt% conjecture -i rod.pnm -o rod-default.ocr
   prompt% rod-default.ocr

The output produced by the above is:

   rejDui,lding rhe ċllu,mian cullectiun af iore uncil ic again
   macches che shelve5 nfThe Library aftke Sublime.
   AdveJ_tu_e Hook: A ja_tian amang tke githyanXi is
   rumored tu hn_e reassembled a workine copy oîrhe Rinlal

Now, visually compare the above output with the desired output:

   rebuilding the illumian collection of lore until it again
   matches the shelves of the Library of the Sublime.
     Adventure Hook: A faction among the githyanki is
   rumored to have reassembled a working copy of the Ritual

Note that the ocr is certainly not perfect, but does manage to recognize a fair number of characters. Note also that it isn't easy to tell where the differences are at a glance (we will address this issue below).

The following also invokes conjecture, but explicitly specifies which OCR Module to use (in this case, the gocr module). Note that Conjecture started out being a restructuring of gocr, but quickly evolved into a much more general framework. Although gocr is currently the basis for all of the functional code, this is expected to change relatively soon, with other third-party OCRs (and personal OCR contributions) coming to hold equal weight with gocr.

   prompt% conjecture -M gocr -i rod.pnm -o rod-gocr.ocr

The following shows that not specifying a -M flag appears to produce the same output as using -M gocr.

   prompt% diff rod-default.ocr rod-gocr.ocr

Although the above would make immediate sense if the default value for -M were gocr, that isn't actually the case, as the default value is default. However, the DefaultModule currently inherits from GocrModule, does not redefine anything, and thus produces the same output as gocr. This will change as Conjecture evolves and the best component implementations are identified (the default module is meant to be the "best" overall OCR we know how to make at any given time.

The following uses the ocrad module. Note that this may not succeed if ocrad failed to compile on your architecture (support for ocrad has not been tested very thoroughly yet).

   prompt% conjecture -M ocrad -i rod.pnm -o rod-ocrad.ocr
   prompt% cat rod-ocrad.ocr

The output generated by the ocrad module for this input file is:

   T_b_lldLng _he LIL_rman tollPt_lon o_ lo__ unTll |_ ag_In
   m__ch<_ _he __el_e_ o__hc Llb_ary oF[hP Subllm_
   Ad_c7lu_e Hook. A _a__Lon among lhe g__hy_nhl |_
   __mo_Pd _o há_e Tea__crnble_ a woTkl_g copy oF[hc RL_Llal

The current mechanism that Conjecture is using to provide access to ocrad is quite different than the mechanism used to provide gocr. In particular, the gocr library is compiled directly into the conjecture executable, and its codebase is available for programmers to explore, modify and borrow as desired. On the other hand, currently the ocrad OCR is being supported by using a sub-process that invokes ocrad. As such, Conjecture does not have any access to its internals, and programmers cannot explore, modify and borrow from it. This is temporary (Conjecture will be providing direct access to ocrad when we (you) add such support), but in the mean time it is a useful means of demonstrating how flexible Conjecture is in what kinds of external OCRs it can support.

I was surprised to discover that most open-source OCR projects appear to be lacking facilities for assessing the accuracy of their codebase [wmh: is this true?]. Which I find rather puzzling, as it seems like a rather important feature for developers wanting to assess the relative merits of different strategies for solving OCR-related issues.

In any event, Conjecture provides the ever-so-handy ocrdiff script. It accepts two files as arguments, the first one being the output generated by an OCR, and the second one being the "canonical" output (the output that an OCR would produce if it were 100% accurate). For example, we can establish the accuracy of the gocr module on the rod.pnm input file as follows:

   prompt% ocrdiff -F rod rod-gocr.ocr rod.valid

Note the use of rod.valid in the above command. It is assumed you are still in the harness/db directory. In addition to maintaining a collection of images useful for testing purposes, the harness/db directory also maintains "canonical" text files for each image file. These files have a .valid suffix.

The above ocrdiff command produces the following on stderr:

   Writing to tmp.{od,od+,od-}
   Results Summary:
      OCR  File : rod-gocr.ocr    (  218 chars,  4 lines)
      Goal File : rod.valid       (  217 chars,  4 lines)
      Missing   :     4
      Spurious  :     5
      Mismatches:    33
      All Errors:    42
      Accuracy  :    80.6%        (1-(42/217))*100

From this output, we can see that the gocr module has an overall accuracy of 80.6% for the input image in question. It missed segmenting 4 glyphs (possibly because segmentation identified two glyphs as one in 4 situations), generated 5 excess glyphs (possibly because it split what should have been one glyph into two glyphs), and incorrectly identified 33 glyphs (out of 217).

In addition to the above output, the ocrdiff script creates three files. The full details can be found by executing

    prompt% ocrdiff -h
so here we will only present a very brief discussion of the most useful of these generated files, tmp.od. This file provides a sophisticated character-by-character difference summary, and although the output at first might appear somewhat unintuitive, it ends up being a very useful way to understand where the OCR is failing, and where it is succeeding.
   ---------- Line   1/  1 ------------------------------------------
      |rejDui,lding rhe D%llu,mian cullectiun af iore uncil ic again
      |rejDui,lding rhe D%llu,mian cullectiun af iore uncil ic again
    13|  jD  ,      r   D%   ,      u      u  a  i      c    c      
     9|  $b  $      t   $i   $      o      o  o  l      t    t      
     4|re$bui$lding the $illu$mian collection of lore until it again
      |rebuilding the illumian collection of lore until it again
   ---------- Line   2/  2 ------------------------------------------
      |macches che shelve5 nfThe Library aftke Sublime.
     2|macches che shelve5 nf$The Library af$tke Sublime.
     9|  c     c         5 n $T           a $ k          
     7|  t     t         s o  t           o   h          
      |matches the shelves of the Library of the Sublime.
      |matches the shelves of the Library of the Sublime.
   ---------- Line   3/  3 ------------------------------------------
      |AdveJ_tu_e Hook: A ja_tian amang tke githyanXi is
     2|$$AdveJ_tu_e Hook: A ja_tian amang tke githyanXi is
    11|$$    J_  _          j _  a    a    k         X    
     8|      $n  r          f c  o    o    h         k    
     1|  Adve$nture Hook: A faction among the githyanki is
      |  Adventure Hook: A faction among the githyanki is
   ---------- Line   4/  4 ------------------------------------------
      |rumored tu hn_e reassembled a workine copy oC.rhe Rinlal
      |rumored tu hn_e reassembled a workine copy oC.rhe Rinlal
     9|         u  n_                      e       C.r     nl  
     9|         o  av                      g       f t     tu  
      |rumored to have reassembled a working copy of the Ritual
      |rumored to have reassembled a working copy of the Ritual

Each line in the two input files generates 6 lines in this output file (plus an initial header that displays the corresponding line numbers within the two files. Details on the format are again available via ocrdiff -h. For now, suffice it to say that lines 4 and 5 of each 7-line "record" shows the set of characters that failed (line 4 shows the OCR predictions, line 5 is shows the correct characters). In these two lines of output, only characters that differ are shown, and thus allow us to see at a glance where the OCR is failing.

For example, let's look at the second record in the above output (the record of 7 lines starting with a line containing Line 2/ 2). The first line after the header shows what the gocr OCR generated. The last line in the record (just before the ------..) shows what should have been generated. The second and second-last lines show the input strings after special $ characters have been inserted to maximize the amount of character-to-character correspondence. This process is critical to the proper functioning of ocrdiff (discussed in more detail in ocrdiff help). The numbers to the left of these lines indicate how many $ characters were inserted, and correspond respectively to the number of characters the OCR missed (didn't segment into as glyphs) and the number of characters the OCR mistakenly identified as being two characters.

The middle two lines, as already mentioned, show only those characters that differ, and analyzing these lines can give some useful insight into what is going wrong. Note that the first mismatch is c vs. t. For the given input image, these two characters look surprisingly similar, so it isn't surprising that gocr is having difficulties with them. The same is true of all of the other mismatches on the line. Note that the existence of $ characters also indicates that gocr failed to identify two glyphs in this line.

Having briefly explain what ocrdiff does, let's now use it to assess the accuracy of the ocrad module. The -F flag specifies what file prefix to use when generating the .od file(s).

   prompt% ocrdiff -F rod-ocrad rod-ocrad.ocr rod.valid
which produces the following on stderr:
   Writing to rod-ocrad.{od,od+,od-}
   Results Summary:
      OCR  File : rod-ocrad.ocr   (  214 chars,  5 lines)
      Goal File : rod.valid       (  217 chars,  4 lines)
      Missing   :     6
      Spurious  :     2
      Mismatches:    84
      All Errors:    92
      Accuracy  :    57.6%        (1-(92/217))*100

From the above, we can see that the ocrad module is given much poorer results than gocr for this particular input image (using the default parameter values for each module). However, keep in mind that the input image, and the parameters sent to each ocr, can have a huge affect on overall accuracy.

The preceeding discussion provides a very brief overview of how to get started with conjecture. However, there are many command-line flags not yet discussed, and numerous other useful scripts provided by Conjecture. Of these scripts, one in particular is very useful for individuals wanting to gain a familiarity with the Conjecture infrastructure and codebase. If you want to search for an arbitrary string among all possible places it might reside, you can use ocrfind.

   prompt% ocrfind -u

The above creates a database of files that ocrfind will use in searches. It takes awhile, but only needs to be generated once a day (or so). The command

   prompt% ocrfind -h
gives more details on how to use this script. Keep it in mind, because it can be very useful if you are planning on contributing to Conjecture in any way.


Text here

Quick Links

  Downloads : V-0.06
  Howto : Install
  Community : Mailing List
  To Do : Questions
Design Implementation Infrastructure

Conjecture is using services provided by SourceForge