TODO
A collection of things to do, divided into various
categories. Some of these things will be trivial for you to help
address - please contribute where you can.
Easy Issues (Low-hanging fruit)
Input Images
------------
- Convert $CONJECTURE/harness/db/*.pnm to max intensity 255.
The 4x6.pnm, etc. files have max intensity 65535 and ocrad
fails on them (poor dumb ocrad!). We really don't need
65535 anyway, so it isn't a problem to conver them.
HTML Pages
----------
- The current web page structure is using TABLE for structuring. We
want to switch over to using DIV to make things more flexible.
If you have experience with CSS, contact bruno and wade so we can
colloborate on a redesign.
- Related to the above, we are currently using BLOCKQUOTE for
indentation within sections. Using CSS will be much better.
We need a CSS-knowledgable contributor!
- Figure out why indentation of paragraphs after paragraphs isn't
working properly. I know it is trivial in CSS, but am obviously
doing something wrong. See the P P entry in style.css and tell
me what to fix.
- A new format for describing existing Components is needed. This
is the old-style format that needs to be made more readable.
- A new format for the TODO page (the page you are reading right now)
must be designed. However, it should be very easy for people to add
new entries, so requiring a complex HTML structure for each entry
is not desirable. Some combination of PHP and CSS might make it
look better and still make it easy for people to specify content
without having to typeset.
- How do we stop HTML from pushing the Quick Links sidebar further
left when the page contains PRE or IMG elements that cannot wrap?
Once the Quick Links is a DIV element, some sort of
relative-to-right margin positioning request? But we want text to
flow around it. I some handy embedded figure CSS code I'll dig out
- maybe that will work nicely.
High-level Documentation
------------------------
- If you notice typos or other kinds of errors in any documentation,
please fix them! New users are the ones most likely to detect such
errors, and also most in need of clear, concise, accurate
explanations, so please help maintain the documentation. High level
documentation is in $CONJECTURE/doc/web . Class and
method-level documentation is in the header files in
$CONJECTURE/src/*
- Write a tutorial for new Modules, with a simple example that can
be used as a guide for new developers.
Class Documentation
-------------------
- We need to insert a header in every source file. Presumably SVN has
substitution variables like CVS does? Who knows what variables to
use???
Compiler Warnings
-----------------
- [wade] On cygwin, use of 'std::list<...>::push_back() generates
compiler warnings of the following form:
/bin/../lib/gcc/i686-pc-cygwin/3.4.4/include/c++/bits/stl_list.h:435:
warning: '__p' might be used uninitialized in this function
Anyone see the reason? Line 435 of my cygwin platorms stl_list.h is
_Node* __p = this->_M_get_node();
which doesn't seem to be a candidate for such a warning (if anything,
the field returned by '_M_get_node()' should be being reported as
potentially unused, since '__p' itself cannot be uninitialized unless
the return of '_M_get_node' is uninitialized.
Google reports a few others having this problem, but no solutions as far
as I could tell.
Design Issues
Components
----------
- Many more components must be identified and formalized!
- Currently we only have the "broad strokes".
ProcessComponent provides total control over the OCR
implementation. SegmentComponent does Page-to-Glyph
segmentation (and optionally intermediate segmentation).
IdentifyComponent does Glyph-to-unicode translation.
FormatComponent does text placement (words, lines,
columns, etc.).
ProcessComponent is almost always implemented by executing a
SegmentComponent, then an IdentifyComponent, then a
FormatComponent. That is, components can be composed of a number of
sub-components (but don't have to be!).
The next critical design issue wrt Components is to subdivide each
of Segment, Identify and Format into sub components. For example,
there should be a DeskewComponent, NoiseRemovalComponent, and
numerous other components within one or more of these "high-level"
components.
Identifying which issues should be formalized into Components is
not always easy. For example, should thresholding and pixel
filtering be separate components, or should they be part of a
single component? Part of the fun of Conjecture is exploring the
best design!
Specialized Modules
-------------------
- How do we go about representing specialized modules (those
modules designed to only work for hand-written text, or only
for numbers, or only for dot-matrix input, etc. etc.).
- One possibility is to define subclasses of Module representing
these specializations. But is this the best solution? Or should the
individual Component hierarchies contain specialized
abstract subclasses to deal with this issue?
Things To Do Related to Implementation
Exceptions
----------
- Must decide what to do about exceptions. Currently just throwing
'string' instances, but a much cleaner solution involving a
hierarchy of exceptions classes is needed eventually. Don't want to
require Boost until it becomes the standard.
Naming Issues
-------------
- Find better names for certain classes:
- The 'Region' class is a candidate for renaming.
- The 'Env' class is a candidate for renaming.
GOCR Documentation
------------------
- The Gocr Module provides an interface to the GOCR codebase.
However, although gocr can provide good results, its codebase is
very poorly documented.
- Provide an explanation of what each command-line flag does
- Provide an overview discussion of the flow-of-control in
gocr and the important methods and files.
- Provide a detailed discussion of the critical methods
like ocr0_eE(), etc.
- All of this documentation should be placed in the
src/modules/gocr directory (either in source files or other
discussion files).
Things To Do Related to Infrastructure
Name
----
- Determine whether we are staying with the name Conjecture, or
whether we will be changing the name to something else.
- During initial development bruno and wade explored various names,
the favored being Apocryphal and Sorcerer. However, due to the
negative (if amusing) connotations of Apocryphal, and existing
image analysis software named Sorcerer, we decided to go with
Conjecture as our "temporary" name.
- A name that has the letters 'o', 'c', and 'r' in it seems
appropriate. The following one-liner:
% perl -ne 'next if substr($_,0,6) eq substr($last,0,6);
if ( m/(.*)o(.*)c(.*)r(.*)/ ) {
chomp;
printf("%-20s = %s\n", $_, "$1O$2C$3R$4");
$last = $_;
}' /usr/share/dict/words
produces a candidate set of possible names. Editing it
based on double entrendres wrt OCR or amusement value gives the
following possibilities:
apocryphal = apOCRyphal
aristocracy = aristOCRacy
autoincrement = autOinCRement
binocular = binOCulaR
boxcar = bOxCaR
cockroach = cOCkRoach
concentrate = cOnCentRate
concord = cOnCoRd
concur = cOnCuR
confectionery = cOnfeCtioneRy
conjecture = cOnjeCtuRe
consecrate = cOnseCRate
constructor = cOnstruCtoR
democracy = demOCRacy
doctor = dOCtoR
hypocrisy = hypOCRisy
idiosyncrasies = idiOsynCRasies
jockstrap = jOCkstRap
mediocre = mediOCRe
molecular = mOleCulaR
nomenclature = nOmenClatuRe
obfuscatory = ObfusCatoRy
obscure = ObsCuRe
octahedra = OCtahedRa
omicron = OmiCRon
orchard = OrChaRd
orchestra = OrChestRa
popcorn = pOpCoRn
postscript = pOstsCRipt
processor = prOCessoR
rhinoceros = rhinOCeRos
scorcher = scOrCheR
scorecard = scOreCaRd
soccer = sOcCeR
Socrates = SOCRates
sorcerer = sOrCereR
Velociraptor = VelOCiraptoR
woodpecker = woOdpeCkeR
Web Pages
---------
- The web pages always need more work. If you are reading
documentation for the first time, and something isn't making sense,
make a note of it! When it starts making sense, go back and edit
the documentation to make things more clear!
Installation
------------
- Set up a proper installation environment
- autoconf (assigned to Bruno)
- maintain the $CONFIGURE/external/Makefile ,
which should automatically download/compile/install third-party
source code useful to Conjecture.
Platform Support
----------------
- Verify that code compiles and works on common platforms
- I have been testing on linux and cygwin machines
- The stupid case-insensitive file directory convention
on Windows has posed problems on more than one occasion.
Hosting
-------
- Decide when to move the codebase from
the two temporary mirrors
http://www.holst.ca/conjecture
and
http://www.corollarium.com/conjecture
to someplace more formal.
- Do we want to go to sourceforge? Bruno says sourceforge has
SVN - that's good!
Test Harness
------------
- The test harness is meant to be an (ever expanding) collection
of images and expected output, and an infrastructure for testing
the output of 'conjecture' against that expected output.
- Extend the test harness
- Obtain new images and add them to the harness/db directory
(for now, lets keep them relatively small, and in .pnm
format).
- Create 'harness/db/<name>.valid' file by typing in the
text that a 100% accurate OCR would produce. If you can automate
this process, please let us know :-) :-) :-) hee hee.
- Create harness/<module>/<variant>/<name>.val using
ocrtest -u .
- Modify harness/Makefile (or generalize the makefile to look
for all .pnm files!)
- Since the image database can end up growing arbitrarily large, we
will need to sub-divide it into self-contained "image bundles".
Some bundles can be specific to a certain kind of image
(handwritten, large font, numbers only, etc.), while other bundles
will contain many different kinds of images.
- Not all of of these bundles will be placed in the SVN
repository. Instead, a "code" bundle will be in the repository,
and additional bundles can be downloaded from the website as
desired. Makefile and script support for dealing with
alternative databases can be added (i.e. add a -D flag to
ocrtest to specify which non-standard image bundle
to use during testing).
Feature Wishlist
This is a place to jot down ideas for desired features.
However, once they have been explored more thoroughly, they will
be moved into the Design/Implementation/Infrastructure sections.
|
|