The Testing Harness of Conjecture
Conjecture encourages users to contribute everything from tiny
modifications in existing methods of existing classes, to full-fledged
OCR Modules. However, modifications to the codebase can potentially
affect more than is immediately apparent. Testing a change by using a
single set of command-line flags, for a single OCR module, on a single
input image, does not accurately assess the overall impact that a
change has on the Conjecture environment as a whole.
For this reason, Conjecture provides an assessment infrastructure with
the following features:
- A (arbitrarily large) collection of input images
- A corresponding collection of text files representing what an
OCR implementation should produce, if it had 100% accuracy.
- The ability to test individual OCRs or groups of OCRs (or all OCRs),
against individual images or groups of images (or all images).
- Every OCR Module has an associated set of configurable parameters,
values that somehow influence performance and can be
configured by the user. A particular combination of values for
all configurable parameters represents a single module
variant. The assessment infrastructure allows one to test
a specific variant or a group of variants. Note that the
complete set of variants possible is effectively infinite
because some parameters can take on floating point values).
For this reason, Conjecture provides a way of enumerating and
naming variants that produce interesting results.
- Can establish the accuracy of each variant of each OCR.
- Can report the affect that a change in the code base has
on the entire execution environment (increased or decreased
accuracy, increased or decreased time-to-compute, etc.)
The $CONJECTURE/harness directory structure
$CONJECTURE/harness has three important components:
- The
../config/Conjecture.tests file, describing
algsets and algset variants to test. [wmh: this filename will be
changing!]
- The
db subdir
- The
modules subdir
The $CONJECTURE/config/Conjecture.tests file
This file describes the set of OCR Modules and module variants to test.
# this is the default module,
# used if -M is not specified
MODULE default
VARIANT v1;
MODULE gocr
# variant 1 of the gocr module
VARIANT v1;
MODULE ocrad # a comment, ignored
VARIANT v1;
VARIANT v2 FLAGS { -Z "-T .7" };
VARIANT v3 FLAGS { -Z "-T .75" };
In this file, anything after a '#' (including the '#')
is removed as a comment). The ocrgen ignores
blank lines in this file. Except for the above, there are only
two legal line formats within this file (any other line format
is an error and is ignored by ocrgen ).
- an line containing the keyword
MODULE
followed by <module> name. The <module>
must be a legal identifer (start with a letter or
underscore, and continue with letters, underscores or
digits) and correspond to the name of Module
defined in
$CONJECTURE/config/Conjecture.modules .
[wmh: the .modules and .tests files will probably be merged later]
- a line containing the keyword
VARIANT ,
followed by a variant name, <variant>, followed
optionally by the keyword FLAGS an open
brace, an arbitrary sequence of conjecture
command line flags, a close brace, and a semicolon. The
variant name must be a legal identifier. If
FLAGS is not provided, default values are
used for all flags. If it is provided, the open/close
braces, flags, and semicolon must be provided.
Note that the value associated with FLAGS
does not represent the entire set of flags sent to
conjecture . In particular, the following flags
are implicitly provided by the test harness during each
invocation of conjecture generated by the
test harness.
-V 0
-M <module>
-i <input>.<suffix>
-o modules/<module>/<variant>.ocr
All of these flags are added after the flags from
the VARIANT (and thus override any attempts to
define them within FLAGS ).
[wmh: In general, XML is the proper way to describe
input formats nowadays. However, I wanted this file to be
accessible to non-programmers, and didn't want to force them
to learn XML. By defining a very simple format, with only two
kinds of acceptable lines, it is hoped that
programming-phobic individuals will be able to edit and
experiment with Conjecture at this level. The format is
actually based on a xml-equivalent syntax (related to
research completely unrelated to OCRs).]
The harness/db directory
This directory contains an arbitrarily large collection of input
image files in various graphics formats. Each input image has a
corresponding ,valid file providing the text
representation of the image file. The .valid file
represents the desired output that should be produced by an OCR
that has 100% accuracy.
One ramification of the above is that if two image files have the same
prefix, with different suffixes (.jpg vs .pnm, for example), both
images must refer to exactly the same textual content, since both
files will be described by the single associated .valid file.
The harness/modules directory
The modules directory contains subdirectories for each
OCR module, <module>. Within each of these directories, a
subsubdirectory exists for each variant, <variant>.
The test harness involves executing conjecture on many
input files using many command-line variants. The output from each
run is placed into
harness/ modules/ <module>/ <variant>/ <input>.ocr .
The variant subdirectories contain, in addition to
.ocr files, <input>.val files.
These files are similar to
harness/db/<input>.valid files, except that
rather than representing the perfect output, they represent
the "expected" result for the given module and variant. These
files are used in order to quickly determine whether a change
in the code base has had an effect (intentional or otherwise)
on the performance of a specific module variant.
The ocrtest Script
The ocrtest script is the interface to the Conjecture
test harness. It is usually invoked with one of these purposes in
mind:
- validity testing
- assessment testing
- comparison testing
- individual testing
Validity Testing
When changes to the codebase have been performed, it is important to
know the overall impact these changes have had. Testing on a single
input file with a single combination of flags does not guarantee that
a change doesn't introduce unintentional affects that detrimentally
(or positively) affects some other module and/or variant.
Using the command:
% ocrtest -a
runs the test harness over every module and variant specified
in $CONJECTURE/ config/ Conjecture.tests , for
every image file found in $CONJECTURE/harness/db ,
writing results into harness/ modules/ <module>/
<variant>/ <input>.ocr . This result is then
compared against harness/ modules/ <module>/
<variant>/ <input>.val . A 2D table is then
generated, with module/variants as rows, and input files as
columns. Within the table, if the module/variant/input
produces an .ocr that matches .val
exactly, the table entry will contain plus ('+') sign. If they
do not match, this is indicated by an asterisk ('*') sign.
Numerous other characters are used to indicate others kinds
of errors ('@' if .ocr was generated, '%' if
.val file is missing, '?' if the executable
exited abnormally).
An example of the output produces by ocrtest
during verification is shown below:
============================================
Name 4x6 5x7 5x8 ocr-a ocr-b rod
--------------------------------------------
default
v1 + + + + + +
gocr
v1 + + + + + +
ocrad
v1 + + + + + +
v2 + + + + + +
v3 + + + + + +
--------------------------------------------
Table: Conjecture Test Harness
============================================
If the table contains all '+', then the entire application
is working exactly as expected. However, if any '*' appear (or
any of a number of other symbols), it indicates that
unexpected output has occurred. Whether this is a good thing
or a bad thing depends on whether the difference represents an
increase in accuracy or not, which is established using
assessment testing, discussed next.
[wmh: document the other symbols that have been
added!]
The ocrtest script accepts a -x
argument that specifies the specific conjecture
executable to use. By default, it is conjecture ,
and the PATH environment variable is used to establish where
to find it. However,
$CONJECTURENROOT/harness/Makefile explicitly
specifies a -x ./ocrprog flag to its invocations
of ocrtest . Various targets in
$CONJECTURE/src copy executables to
$CONJECTURE/harness/ocrprog when appropriate.
Assessment Testing
The goal of an OCR is to produce accurate results. The assessment
capabilities of the Conjecture test harness allow one to see how every
module variant performs across all input files, by reporting the
accuracy (as a percentage correct relative to the expected output).
Assessment teseting allows us to identify the best module variant for
each input file. As well, when changes to the code base have produced
changes in output (as indicated by validity testing discussed above),
assessment testing allows one to see whether the changes were an
improvement or a hindrance to overall accuracy.
Using the command:
% ocrtest -A
does everything that validity testing does, but in addition, it
executes the ocrdiff to compare
harness/modules/<module>/<variant>/<input>.val
against db/<input>.valid . Remember that
db/<input>.valid represents the "goal" output that
an ocr with 100% accuracy would produce. The ocrdiff
script performs an intelligent character-by-character analysis to
establish the accuracy of the ocr output.
The result of assessment testing is another table, like this:
============================================
Name 4x6 5x7 5x8 ocr-a ocr-b rod
--------------------------------------------
default
v1 87 97 98 88 100 80
gocr
v1 87 97 98 88 100 80
ocrad
v1 0 0 0 49 -22 57
v2 0 0 0 49 -22 54
v3 0 0 0 49 -22 57
--------------------------------------------
Table: Conjecture Test Harness
============================================
It reports the percentage accuracy for each module variant when
applied to every input image. It also reports the validity
information, except that anytime a '+' would have been shown in the
validity table, a space (' ') is shown (to avoid unnecessary clutter).
[wmh: update this table, explain why the -22 is occuring, etc.]
Comparison Testing
Comparison testing is very similar to Assessment testing:
% ocrtest -A -x <newexec> -X <oldexec>
The only difference is that assessment results are computed for two
different executables. Individual tables are shown for each. And then
a "difference" table is presented showing the accuracy of the first
minus the accurace of the second. This gives a convenient means of
assessing at a glance the relative impact that a particular change in
the code base has had on overall performance. Positive entries in this
table indicate improvements in accuracy, while negative entries
indicate a worsening of accuracy.
Naturally, comparison testing assumes that you have two different
executables to compare. When you are experimenting with a new
algorithm, you should always make a copy of the "baseline" executable
(and any other incremental improvements along the way) so that you
will have them available for comparison testing.
|
|