PDF Text Extraction Approach Using OCR
Posted
by Jon
on Stack Overflow
See other posts from Stack Overflow
or by Jon
Published on 2009-04-22T16:38:31Z
Indexed on
2010/03/26
23:33 UTC
Read the original article
Hit count: 594
Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written.
I'm familiar with pdfbox, which is now an Apache incubator project at version 0.8.x, but it's text extraction isn't always accurate. I'm looking for an alternative approach that is somewhat more reliable.
I've not tried Asprise JavaPDF yet, in the process of trying that, but wanted to know more about the OCR approach (if it's possible).
Any help would be appreciated.
© Stack Overflow or respective owner