This discussion assumes that you have already installed Conjecture, have
added $CONJECTURE/install/bin
to your PATH
, and
have added $CONJECTURE/install/lib/perl
to your
PERLLIB
. Let's perform som simple checks; the following
commands ensure that you have set your PATH
properly.
If you get errors, or if the paths reported aren't
$CONJECTURE/install/bin/*
, there is a problem.
prompt% cd $CONJECTURE
prompt% which conjecture
prompt% which ocrdiff
prompt% which ocrfind
prompt% which ocrgen
The following checks if you've set your PERLLIB
correctly. It will take up to a minute to finish (but once it is,
ocrfind
will thereafter be very efficient). Here, we
are just using it to test PERLLIB
. We'll talk more
about its purpose later. If you get an error when you execute this
script, it probably means that your PERLLIB
does not
include $CONJECTURE/install/lib/perl
.
% ocrfind -u
The following shows the database of images available (it is
currently very small). We (you :-) will be growing this database
over time. If you have images of interest, please feel fre
submit them (with some caveats).
prompt% cd harness/db
The following assumes you have the display
program
on your machine. Any program that can display .pnm files will work
just as well, of course (xv
, gimp
, etc.).
The rod.pnm
file is my current preferred test image
(for D&D enthusiasts it might be recognizable), simply because it is
the first image I tried to process when I looked at gocr.
prompt% display rod.pnm &
The following command invokes the conjecture
executable, specifying rod.pnm
as the input image file,
and rod.ocr
as the output text file. NOTE: Conjecture
usually uses the .ocr
suffix to represent text output
from an execution of conjecture
).
prompt% conjecture -i rod.pnm -o rod-default.ocr
prompt% rod-default.ocr
The output produced by the above is:
rejDui,lding rhe Ällu,mian cullectiun af iore uncil ic again
macches che shelve5 nfThe Library aftke Sublime.
AdveJ_tu_e Hook: A ja_tian amang tke githyanXi is
rumored tu hn_e reassembled a workine copy oîrhe Rinlal
Now, visually compare the above output with the desired output:
rebuilding the illumian collection of lore until it again
matches the shelves of the Library of the Sublime.
Adventure Hook: A faction among the githyanki is
rumored to have reassembled a working copy of the Ritual
Note that the ocr is certainly not perfect, but does manage to
recognize a fair number of characters. Note also that it isn't easy
to tell where the differences are at a glance (we will address this
issue below).
The following also invokes conjecture
, but
explicitly specifies which OCR Module to
use (in this case, the gocr
module). Note that
Conjecture started out being a restructuring of gocr, but quickly
evolved into a much more general framework. Although gocr is
currently the basis for all of the functional code, this is expected
to change relatively soon, with other third-party OCRs (and personal
OCR contributions) coming to hold equal weight with gocr.
prompt% conjecture -M gocr -i rod.pnm -o rod-gocr.ocr
The following shows that not specifying a -M
flag
appears to produce the same output as using -M gocr
.
prompt% diff rod-default.ocr rod-gocr.ocr
Although the above would make immediate sense if the default
value for -M
were gocr
, that isn't
actually the case, as the default value is default
.
However, the DefaultModule
currently inherits from
GocrModule
, does not redefine anything, and thus
produces the same output as gocr. This will change as Conjecture
evolves and the best component implementations are identified (the
default
module is meant to be the "best" overall OCR we
know how to make at any given time.
The following uses the ocrad
module. Note that this
may not succeed if ocrad
failed to compile on your
architecture (support for ocrad
has not been tested
very thoroughly yet).
prompt% conjecture -M ocrad -i rod.pnm -o rod-ocrad.ocr
prompt% cat rod-ocrad.ocr
The output generated by the ocrad
module for this input
file is:
T_b_lldLng _he LIL_rman tollPt_lon o_ lo__ unTll |_ ag_In
m__ch<_ _he __el_e_ o__hc Llb_ary oF[hP Subllm_
Ad_c7lu_e Hook. A _a__Lon among lhe g__hy_nhl |_
__mo_Pd _o há_e Tea__crnble_ a woTkl_g copy oF[hc RL_Llal
The current mechanism that Conjecture is using to provide access
to ocrad
is quite different than the mechanism used to
provide gocr
. In particular, the gocr
library is compiled directly into the conjecture
executable, and its codebase is available for programmers to
explore, modify and borrow as desired. On the other hand, currently
the ocrad
OCR is being supported by using a sub-process
that invokes ocrad
. As such, Conjecture does not have
any access to its internals, and programmers cannot explore, modify
and borrow from it. This is temporary (Conjecture will be providing
direct access to ocrad
when we (you) add such support),
but in the mean time it is a useful means of demonstrating how
flexible Conjecture is in what kinds of external OCRs it can
support.
I was surprised to discover that most open-source OCR projects
appear to be lacking facilities for assessing the accuracy of their
codebase [wmh: is this true?]. Which I find rather puzzling, as it
seems like a rather important feature for developers wanting to
assess the relative merits of different strategies for solving
OCR-related issues.
In any event, Conjecture provides the ever-so-handy
ocrdiff
script. It accepts two files as arguments, the
first one being the output generated by an OCR, and the second one
being the "canonical" output (the output that an OCR would produce
if it were 100% accurate). For example, we can establish the
accuracy of the gocr
module on the rod.pnm
input file as follows:
prompt% ocrdiff -F rod rod-gocr.ocr rod.valid
Note the use of rod.valid
in the above command. It
is assumed you are still in the harness/db
directory.
In addition to maintaining a collection of images useful for testing
purposes, the harness/db
directory also maintains
"canonical" text files for each image file. These files have a
.valid
suffix.
The above ocrdiff
command produces the following on
stderr
:
Writing to tmp.{od,od+,od-}
Results Summary:
OCR File : rod-gocr.ocr ( 218 chars, 4 lines)
Goal File : rod.valid ( 217 chars, 4 lines)
Missing : 4
Spurious : 5
Mismatches: 33
All Errors: 42
Accuracy : 80.6% (1-(42/217))*100
From this output, we can see that the gocr
module
has an overall accuracy of 80.6% for the input image in question. It
missed segmenting 4 glyphs (possibly because segmentation identified
two glyphs as one in 4 situations), generated 5 excess glyphs
(possibly because it split what should have been one glyph into two
glyphs), and incorrectly identified 33 glyphs (out of 217).
In addition to the above output, the ocrdiff
script
creates three files. The full details can be found by executing
prompt% ocrdiff -h
so here we will only present a very brief discussion of the most
useful of these generated files, tmp.od
. This file
provides a sophisticated character-by-character difference summary,
and although the output at first might appear somewhat unintuitive,
it ends up being a very useful way to understand where the OCR is
failing, and where it is succeeding.
---------- Line 1/ 1 ------------------------------------------
|rejDui,lding rhe D%llu,mian cullectiun af iore uncil ic again
|rejDui,lding rhe D%llu,mian cullectiun af iore uncil ic again
13| jD , r D% , u u a i c c
9| $b $ t $i $ o o o l t t
4|re$bui$lding the $illu$mian collection of lore until it again
|rebuilding the illumian collection of lore until it again
---------- Line 2/ 2 ------------------------------------------
|macches che shelve5 nfThe Library aftke Sublime.
2|macches che shelve5 nf$The Library af$tke Sublime.
9| c c 5 n $T a $ k
7| t t s o t o h
|matches the shelves of the Library of the Sublime.
|matches the shelves of the Library of the Sublime.
---------- Line 3/ 3 ------------------------------------------
|AdveJ_tu_e Hook: A ja_tian amang tke githyanXi is
2|$$AdveJ_tu_e Hook: A ja_tian amang tke githyanXi is
11|$$ J_ _ j _ a a k X
8| $n r f c o o h k
1| Adve$nture Hook: A faction among the githyanki is
| Adventure Hook: A faction among the githyanki is
---------- Line 4/ 4 ------------------------------------------
|rumored tu hn_e reassembled a workine copy oC.rhe Rinlal
|rumored tu hn_e reassembled a workine copy oC.rhe Rinlal
9| u n_ e C.r nl
9| o av g f t tu
|rumored to have reassembled a working copy of the Ritual
|rumored to have reassembled a working copy of the Ritual
Each line in the two input files generates 6 lines in this
output file (plus an initial header that displays the corresponding
line numbers within the two files. Details on the format are again
available via ocrdiff -h
. For now, suffice it to say
that lines 4 and 5 of each 7-line "record" shows the set of
characters that failed (line 4 shows the OCR predictions, line 5 is
shows the correct characters). In these two lines of output, only
characters that differ are shown, and thus allow us to see at a
glance where the OCR is failing.
For example, let's look at the second record in the above output
(the record of 7 lines starting with a line containing Line 2/
2
). The first line after the header shows what the gocr OCR
generated. The last line in the record (just before the
------..
) shows what should have been generated. The second
and second-last lines show the input strings after special
$
characters have been inserted to maximize the amount
of character-to-character correspondence. This process is critical
to the proper functioning of ocrdiff
(discussed in more
detail in ocrdiff
help). The numbers to the left of
these lines indicate how many $
characters were
inserted, and correspond respectively to the number of characters
the OCR missed (didn't segment into as glyphs) and the number of
characters the OCR mistakenly identified as being two characters.
The middle two lines, as already mentioned, show only those
characters that differ, and analyzing these lines can give some
useful insight into what is going wrong. Note that the first
mismatch is c
vs. t
. For the given input
image, these two characters look surprisingly similar, so it isn't
surprising that gocr is having difficulties with them. The same is
true of all of the other mismatches on the line. Note that the
existence of $
characters also indicates that
gocr failed to identify two glyphs in this line.
Having briefly explain what ocrdiff
does, let's now
use it to assess the accuracy of the ocrad module. The
-F
flag specifies what file prefix to use when
generating the .od
file(s).
prompt% ocrdiff -F rod-ocrad rod-ocrad.ocr rod.valid
which produces the following on stderr
:
Writing to rod-ocrad.{od,od+,od-}
Results Summary:
OCR File : rod-ocrad.ocr ( 214 chars, 5 lines)
Goal File : rod.valid ( 217 chars, 4 lines)
Missing : 6
Spurious : 2
Mismatches: 84
All Errors: 92
Accuracy : 57.6% (1-(92/217))*100
From the above, we can see that the ocrad
module is
given much poorer results than gocr
for this particular
input image (using the default parameter values for each module).
However, keep in mind that the input image, and the parameters sent
to each ocr, can have a huge affect on overall accuracy.
The preceeding discussion provides a very brief overview of how
to get started with conjecture
. However, there are many
command-line flags not yet discussed, and numerous other useful
scripts provided by Conjecture. Of these scripts, one in particular
is very useful for individuals wanting to gain a familiarity with
the Conjecture infrastructure and codebase. If you want to search
for an arbitrary string among all possible places it might reside,
you can use ocrfind
.
prompt% ocrfind -u
The above creates a database of files that ocrfind
will
use in searches. It takes awhile, but only needs to be generated
once a day (or so).
The command
prompt% ocrfind -h
gives more details on how to use this script. Keep it in mind,
because it can be very useful if you are planning on contributing to
Conjecture in any way.