The Conjecture Testing Infrastructure

Overview

Conjecture encourages users to contribute everything from tiny modifications in existing methods of existing classes, to full-fledged OCR Modules. However, modifications to the codebase can potentially affect more than is immediately apparent. Testing a change by using a single set of command-line flags, for a single OCR module, on a single input image, does not accurately assess the overall impact that a change has on the Conjecture environment as a whole.

For this reason, Conjecture provides an assessment infrastructure with the following features:

The $CONJECTUREROOT/harness directory structure

$CONJECTUREROOT/harness has three important components:

Conjecture.tests

This file describes the set of OCR Modules and module variants to test.

      # this is the default module, used if -A is not specified
      MODULE default
        VARIANT v1;

      MODULE gocr
        # variant 1 of the gocr module
        VARIANT v1;

      MODULE ocrad
        VARIANT v1;
        VARIANT v2 FLAGS { -Z "-T .7" };
        VARIANT v3 FLAGS { -Z "-T .75" };  # this is a comment and will be removed

There are two legal formats for lines in this file:

Some additional comments about the format:

Note that the flags specified for each variant do not represent the entire set of flags sent to conjecture. In particular, the following flags are implicitly provided by the test harness to each invocation:

       -V 0 -A <algset> -i <input>.<suffix> -o custom/<algset>/<variant>.ocr

All of these flags occur after the flags from the variant (and thus override any attempts to define them within the variants).

[wmh: In general, XML is the proper way to describe input formats nowadays. However, I wanted this file to be accessible to non-programmers, and didn't want to force them to learn XML. By defining a very simple format, with only two kinds of acceptable lines, it is hoped that programming-phobic individuals will be able to edit and experiment with Conjecture at this level. Yes, the format can be made more readable. Yes, I'll work on it sometime :-)]

The harness/db directory

This directory contains an arbitrarily large collection of input image files in various graphics formats. Each input image has a corresponding ,valid file providing the text representation of the image file. The .valid file represents the desired output that should be produced by an OCR that has 100% accuracy.

One ramification of the above is that if two image files have the same prefix, with different suffixes (.jpg vs .pnm, for example), both images must refer to exactly the same textual content, since both files will be described by the single associated .valid file.

Extending the Conjecture Framework

The custom directory contains subdirectories for each OCR module, <modulet>. Within each of these directories, a subsubdirectory exists for each variant, <variant>.

The test harness involves executing conjecture on many input files using many command-line variants. The output from each run is placed into harness/custom/<module>/<variant>/<input>.ocr.

The variant subdirectories contain, in addition to .ocr files from individual executions of the ocr, <input>.val files. These files are similar to harness/db/<input>.valid files, except that rather than representing the perfect output, they represent the "expected" result for the given module and variant. These files are used in order to quickly determine whether a change in the code base has had an effect (intentional or otherwise) on the performance of a specific module variant.


The ocrtest script

The ocrtest script is the interface to the Conjecture test harness. It is usually invoked with one of these purposes in mind:

Validity Testing

When changes to the codebase have been performed, it is important to know the overall impact these changes have had. Testing on a single input file with a single combination of flags does not guarantee that a change doesn't introduce unintentional affects that detrimentally (or positively) affects some other module and/or variant.

Using the command:

   % ocrtest -a

runs the test harness over every module and variant specified in $CONJECTUREROOT/config/Conjecture.tests, for every image file found in $CONJECTUREROOT/harness/db, writing results into harness/custom/<module>/<variant>/<input>.ocr. This result is then compared against harness/custom/<module>/<variant>/<input>.val. If the files match identically, this is indicated with a '+' sign. If they do not match, this is indicated by a '*' sign. A table is produced summarizing the results. An example of such output is:

     ============================================
     Name      4x6   5x7   5x8 ocr-a ocr-b   rod 
     --------------------------------------------
     default                                     
       v1        +     +     +     +     +     + 
     gocr                                        
       v1        +     +     +     +     +     + 
     ocrad                                       
       v1        +     +     +     +     +     + 
       v2        +     +     +     +     +     + 
       v3        +     +     +     +     +     + 
     --------------------------------------------
     Table: Conjecture Test Harness
     ============================================

If the table contains all '+', then the entire application is working exactly as expected. However, if any '*' appear (or any of a number of other symbols), it indicates that unexpected output has occurred. Whether this is a good thing or a bad thing depends on whether the difference represents an increase in accuracy or not, which is established using assessment testing, discussed next.

[wmh: document the other symbols that have been added!]

The ocrtest script accepts a -x argument that specifies the specific conjecture executable to use. By default, it is conjecture, and the PATH environment variable is used to establish where to find it. However, $CONJECTURENROOT/harness/Makefile explicitly specifies a -x ./ocrprog flag to its invocations of ocrtest. Various targets in $CONJECTUREROOT/src copy executables to $CONJECTUREROOT/harness/ocrprog when appropriate.

Assessment Testing

The goal of an OCR is to produce accurate results. The assessment capabilities of the Conjecture test harness allow one to see how every module variant performs across all input files, by reporting the accuracy (as a percentage correct relative to the expected output).

Assessment teseting allows us to identify the best module variant for each input file. As well, when changes to the code base have produced changes in output (as indicated by validity testing discussed above), assessment testing allows one to see whether the changes were an improvement or a hindrance to overall accuracy.

Using the command:

   % ocrtest -A

does everything that validity testing does, but in addition, it executes the ocrdiff to compare harness/custom/<module>/<variant>/<input>.val against db/<input>.valid. Remember that db/<input>.valid represents the "goal" output that an ocr with 100% accuracy would produce. The ocrdiff script performs an intelligent character-by-character analysis to establish the accuracy of the ocr output.

The result of assessment testing is another table, like this:

============================================
Name      4x6   5x7   5x8 ocr-a ocr-b   rod 
--------------------------------------------
default                                     
  v1      87    97    98    88   100    80  
gocr                                        
  v1      87    97    98    88   100    80  
ocrad                                       
  v1       0     0     0    49   -22    57  
  v2       0     0     0    49   -22    54  
  v3       0     0     0    49   -22    57  
--------------------------------------------
Table: Conjecture Test Harness
============================================

It reports the percentage accuracy for each module variant when applied to every input image. It also reports the validity information, except that anytime a '+' would have been shown in the validity table, a space (' ') is shown (to avoid unnecessary clutter).

[wmh: update this table, explain why the -22 is occuring, etc.]

Comparison Testing

Comparison testing is very similar to Assessment testing:

   % ocrtest -A -x <newexec> -X <oldexec>

The only difference is that assessment results are computed for two different executables. Individual tables are shown for each. And then a "difference" table is presented showing the accuracy of the first minus the accurace of the second. This gives a convenient means of assessing at a glance the relative impact that a particular change in the code base has had on overall performance. Positive entries in this table indicate improvements in accuracy, while negative entries indicate a worsening of accuracy.

Naturally, comparison testing assumes that you have two different executables to compare. When you are experimenting with a new algorithm, you should always make a copy of the "baseline" executable (and any other incremental improvements along the way) so that you will have them available for comparison testing.


Generated on Mon Jun 12 20:27:16 2006 for Conjecture by  doxygen 1.4.6