PDF Text Extraction Approach Using OCR

Posted by Jon on Stack Overflow See other posts from Stack Overflow or by Jon
Published on 2009-04-22T16:38:31Z Indexed on 2010/03/26 23:33 UTC
Read the original article Hit count: 594

Filed under:
|
|

Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written.

I'm familiar with pdfbox, which is now an Apache incubator project at version 0.8.x, but it's text extraction isn't always accurate. I'm looking for an alternative approach that is somewhat more reliable.

I've not tried Asprise JavaPDF yet, in the process of trying that, but wanted to know more about the OCR approach (if it's possible).

Any help would be appreciated.

© Stack Overflow or respective owner

Related posts about java

Related posts about pdf