I'm not quite sure what terminology to search for, so my title is funky... Here is the workflow I've got:
Semi-structured documents are scanned to file. The files are OCR'd to text.
The text is parsed into Python objects
The objects are serialized (to SQL, JSON, whatever) for use.
The documents are structures like this:
HEADER blah blah, Page ###
blah
Garbage text...
1. Question Text...
continued until now. A. Choice text...
adsadsf. B. Another Choice...
2. Another Question...
I need to extract the questions and choices. The problem is that, because the text is OCR output, there are occasional strange substitutions like '2' - 'Z' which makes ordinary regular expressions useless. I've tried the Levenshtein module and it helps, but it requires prior knowledge of what edit distance is to be expected.
I don't know whether I'm looking to create a parser? a lexer? something else? This has lead me down all kinds of interesting but nonrelevant paths. Guidance would be greatly appreciated. Oh, also, the text is generally from specific technical domains, so general spelling tools are not so helpful.
Regarding the structure of the documents, there is no clear visual pattern -- like line breaks or indentation -- with the exception of the fact that "questions" usually begin a line. Crap on the document can cause characters to appear before the actual beginning of the line, which means that something along the lines of r'^[0-9]+' does not reliably work.
Though the "questions" always begin with an int, a period and a space; the OCR can substitute other characters or skip characters. This is not so much a problem with Tesseract or Cunieform, rather with the poor quality of the paper documents.
#
Note: for the project in question, it was decided that having a human prep the OCR'd text was better that spending the time coding a solution. I'd still love good pointers, however.