OCR: How to improve accuracy - existing libraries for removing non-text 'furniture', shapes, etc to

Posted by Rob on Stack Overflow See other posts from Stack Overflow or by Rob
Published on 2010-03-15T15:08:02Z Indexed on 2010/03/15 15:09 UTC
Read the original article Hit count: 722

Filed under:
|
|
|
|

I want to remove rectangles etc that enclose text in a screenshot image, so that I can perform optical character recognition to get accurate text from the screenshot.

Background:

I doing this to extract data from a legacy application for use with other applications. This is the only way to get at this data as associated files are in a closed, proprietary, binary format.

I will be using AutoItScript to drive the application to show data in its UI, then I will screenshot this and feed this to tesseract.

I've already had some success in automating the UI, and have been able to use tesseract to get plain ascii text out of the bitmap.

There are several AutoItScripr forum articles discussing its use with tesseract/OCR but not specifically for my question. http://www.autoitscript.com/forum/index.php?s=6c32c3ece12756e635a619cdf175eff9&showforum=2

What I need to do

There are thin, 1-pixel wide rectangles that closely enclose some text, when fed to tesseract, it sees them as I for example for a verticle line of the rectangle.

Any thoughts on how to remove the rectangles, or best practices?

I'm asking if there is a generic command line based toolset to overwrite rectangles, for example, in .png files. I could then pass the .png through this, then pass it to tesseract.

Details on the tesseract release/setup I've used are as follows:

Go here: http://code.google.com/p/tesseract-ocr/downloads/list - For the basic english generic character set to get Tesseract up and running and recognising your bitmapped text into ascii text, use tesseract-2.00.eng.tar.gz (current version at time of writing is: "English language data for Tesseract (2.00 and up) Jul 2007 989 KB 84845")

Related questions I have already looked at on Stack Overflow

In these, my question is not completely answered or a commercial solution is being sold. I do not want to consider a commercial solution at this stage.

© Stack Overflow or respective owner

Related posts about ocr

Related posts about tesseract