An Extensible Optical Character Recognition Framework

Guidelines For Modifying and Extending The Conjecture Framework

[wmh: these are preliminary and subject to change as the Conjecture community grows - currently just my own personal set of coding conventions - please forgive the soap-box :-)]

Readability, code-reuse, abstraction, modularity, encapsulation and information hiding aren't just buzzwords, they are critical to smooth, ongoing evolution of code. And they don't occur by magic. Here are some guidelines to follow when working on the Conjecture framework.

  • Every class is documented. Every method is documented. Every field is documented. Use '///' or '/** ...*/' so that the class/method/field documentation is available to doxygen. The UML diagrams are to be kept up-to-date. Overall changes in design are kept up-to-date.
  • Every class should have a static 'Test(int argc, char* argv[])' method that acts as a unit test. This method should create one or more instances of the class (and any other classes needed to get the class to work) and test the public interface methods, checking return results/output against expected results to verify that it is working as expected. The 'main' program provides special support for invoking these unit-testing methods instead of performing normal OCR processing, allowing convenient per-class testing.
  • All fields are defined in private scope and are preceeded by an underscore. Accessors are defined for each field. See the more detailed discussion on 'Accessors' for more.
    • getters can be public, but should ALWAYS have read-only semantics, and should be moved out of 'public' scope if the field isn't critical to the public interface.
    • setters should never be at public scope. Use 'friend's when classes need access to the non-public functionality of other classes.
  • Classes should be fully initialized by their constructors. Partial initialization requires class clients to remember to perform additional initialization before the class is safe to be used, and completely defeats the point of having an initializer to begin with! Furthermore, partial-initialization constructors require the introduction of public access to internal state (directly or indirectly via public read-write accessors). Note that this mandate implies that classes should NOT have default constructors unless the default values assigned to fields are not only intuitive and meaningful, but also useable as-is.
  • Avoid symbol conflicts.
    • All classes are placed in the Conjecture namespace.
    • No global variables.
    • Do NOT place 'using namespace std' statements in header files. Yes, I know it allows you to avoid putting 'std::' in front of all sorts of types, but putting such a statement in header files forces third-party code to fully-qualify class names that conflict with those in 'std'.
    • The use of 'using namespace std' in source files is completely ok.
  • Avoid assuming there will only be one instances of any particular class. Support for multi-threading/parallel/distributed algorithms suggests that even classes that seem like Singletons may not be.
  • Write code in a thread-safe manner. Avoid non-const static variables where possible.
  • Document important changes in the $CONJECTURE/ChangeLog file. Especially important to document is the introduction, changing or removal of public interface methods in any kernel or utils class. Concise notes on *everything* are probably better than detailed notes on a few things. Detailed notes can be placed in appropriate system/class/method documentation.
  • All methods should be virtual by default. It is unlikely that the minor gains in efficiency from having a non-virtual method will significantly improve performance. The exception to this rule is setters and getters, which can incur a significant efficiency hit if made virtual because they can then no longer be inlined. Other exceptions include methods that are expected to be executed millions of times.
  • If a method is virtual, explicitly specify the 'virtual' keyword in the class header even when doing so is redundant. This convention allows everyone to establish whether a method is virtual or not by looking in a single header file, instead of having to search every header file up the entire inheritance path when virtualness is in doubt. Although some programming environments make this convention unnecessary, program for all environments, not just your own.
  • The "appearance" of code can have a significant impact on encouraging or discouraging others to contribute. If the code is "open" and "accessible", it is more likely people will be motivated to modify and extend. If the code is obscure, cluttered, or undocumented, it is much less likely that people will contribute. Make efforts to make your code esthetically pleasant. Document classes and methods, of course. Within each method, document each conceptual "block" of code. One line of natural-language description can make 10 lines of code much more accessible!


Suppose we have a field called 'name', of type 'Type' (for example, 'Type' might be 'int' or 'A*'). One way to define it would be:

   class A {
       Type name;

And then an individual would read/write the field using:

   A* a = new A;
   a->name = 10;
   cout << "name = " << a->name << endl;

A public field is a terrible implementation decision because:

  1. All code, everywhere, is allowed to modify (and thus potentially corrupt) its value. There is no "safety" available and modification-at-a-distance is difficult to debug.
  2. If, during the evolution of the code, the type of the field (or the existence of the field) comes into question, changes are both difficult, and guaranteed to break any third-party code that relies on this field.

However, using a simple strategy can address both of these problems and add some additional capability. And in C++, this comes at absolutely no runtime performance cost, as we'll see below.

Instead of defining our field as shown above, we do the following:

   class A {
        inline const Type& name() const { return this->_name; }

        inline void nameIs(Type val) { this->_name = val; }
        inline Type& nameRef() { return this->_name; }

       Type _name;

In the above, we have moved the field definition itself into private scope, and places an underscore in front of it. We have also added three tiny inlined methods providing get/set access to the field.

Features worth noting:

  • The 'name()' method returns a const version of the underlying field. It allows 'read-only' access to the field, but does not allow it to be modified. It is often at public scope, but should be placed in protected scope if the field is not part of the public interface. Obviously, the fewer field accessors in the public interface, the more separation between implementation and interface.
  • The 'nameIs()' method sets the value of the field to its single argument. This method should almost never be placed at public scope. Classes should be designed so that they, are a small set of tightly-coupled friend classes, are responsible for modifying themselves. In this way, corruption of state is much easier to identify, because only a small number of classes can be responsible, and ultimately the change had to have occured via this setter method.
  • The 'nameRef()' method has exactly the same code block as 'name()', but its return value and non-const status make it much more powerful. It too should almost never be placed at public scope, as it provides a second means of modifying the field. For example, if Type is 'int', we can write:
       a->nameRef() = 10;
    because 'nameRef()' returns a reference to the 'int' field. This accessor is useful because anything we could do with the underlying field 'a->_name', we can also do with 'a->nameRef()'. For example:
       cout << a->_name << endl;
       a->_name += 10;
    is the same as
       cout << a->nameRef() << endl;
       a->nameRef() += 10;
    except that using 'nameRef' provides for some separation of interface and implementation. Note that since 'nameRef' is inline, the above two sets of lines really are identical after compilation.

Additional Features

  • The type of the argument in 'nameIs()' can sometimes be made 'const', but when it is a pointer type, it cannot be made 'const' unless the underlying field itself is 'const'.
  • Although many C++ programmers avoid using 'const', constness can have a significant positive impact on code quality. Rather than solving 'const' issues by not using 'const', it takes only a little effort to understand the basic principles. Remember that:
    • If a variable is 'const', it cannot be modified. That is, if it appears on the right-hand side of an assignment statement, the compiler will generate an error.
    • 'const'ness is contagious - if one variable is 'const', it often forces other variables or methods to also be 'const', which may in turn force other variables or methods to be 'const', etc. etc. Because of this, the strategy of "I'll get my code working without 'const' first, then add 'const' semantics in later" is a terrible idea, because it is difficult to "add in 'const'" incrementally - making one variable/method 'const' requires many others to be made const before the code will compile again. Because of this, it is much better to always make EVERYTHING (variables and methods) 'const' by default, at least conceptually, then remove 'const' when it is established that the variable/method in question cannot be 'const' and still retain the desired semantics.
    • If a variable is 'const', then only 'const' methods can be invoked with that variable as a receiver. For example:
         const Person p;
      Only if 'somefunc' is marked as 'const' in its definition:
         void somefunc() const { ... }
      will the above 'p.somefunc();' invocation be allowed by the compiler.
    • A non-static method f() defined within a class A has a special variable called 'this'. Since it is a variable, it must have a type. The type of 'this', within class A, is ALWAYS either 'A* const' or 'const A*const'. Which one it is depends on whether the method f() is 'const' or not.
      • A method is 'const' if the keyword 'const' appears AFTER its parameter list.
      • If 'this' is of type 'A* const', it means that the space for the pointer itself cannot be changed, so the statement
           this = a;
        is not allowed. Since the 'this' variable is ALWAYS 'A* const' or 'const A*const', one is never allowed to modify 'this'. However, a variable being '*const' does not affect the ability to modify what the variable points to, so
           this->some_field = 10;
        isn't disallowed.
      • If 'this' is of type 'const A* const', it means that everything above for 'A* const' holds, AND that we cannot modify what the variable points to, so
           this->some_field = 10;
        becomes a compile-time error.

Quick Links

  Downloads : V-0.06
  Howto : Install
  Community : Mailing List
  To Do : Questions
Design Implementation Infrastructure

Conjecture is using services provided by SourceForge