How to preprocess text to do OCR error correction
Posted
by eaglefarm
on Stack Overflow
See other posts from Stack Overflow
or by eaglefarm
Published on 2010-04-28T01:18:00Z
Indexed on
2010/04/28
1:23 UTC
Read the original article
Hit count: 438
Here is what I'm trying to accomplish: I need to get a several large text files from a computer that is not networked and has no other output except a printer. I tried printing the text, then scanning the printout with OCR to recover the text on another computer but the OCR gets lots of errors (1 vs l, o vs 0, O vs D, etc).
To solve this I am thinking of writing a program to process (annotate?) the text file, before printing it, so that the errors can be corrected from the text output of the OCR program. For example, for 1 (number one) vs l (letter L), I could change the text like this:
sample
inserting \nnn after characters that are frequently wrong in the OCR results:
sampl\108e
Then I can write another program to examine the file, looking for \nnn and check the character before the \nnn (where nnn is the ascii code in decimal) and fix it if necessary. Of course the program will have to recognize that the \nnn may have errors too but at least it knows that the nnn are digits and can easily correct them.
I think I would add a CRC on each line so that any line that isn't corrected perfectly can be flagged as having a problem.
Has anyone done anything like this? If there is an existing way of doing this I'd rather not reinvent the wheel. Or any suggestions for annotation format that would help solve this problem would be helpful too.
© Stack Overflow or respective owner