How to know if a PDF contains only images or has been OCR scanned for searching?

Posted by Bratch on Stack Overflow See other posts from Stack Overflow or by Bratch
Published on 2009-09-28T22:45:42Z Indexed on 2010/04/22 18:13 UTC
Read the original article Hit count: 164

Filed under:
|
|
|

I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one large image, even where the whole page is entirely text. Others were scanned with OCR and contain images and searchable text where text is present. In many cases even words in the images were made searchable.

I want to make an automated process to recognize the text in all of the scanned documents using OCR, with Acrobat 8 Pro, but I don't want to re-OCR the files that have already been through the OCR process in the past. Does anyone know if there is a way to tell which ones contain only images, and which ones already contain searchable text?

I'm planning on doing this in C# or VB.NET but I don't think being able to tell the two kinds of files apart is language dependent.

© Stack Overflow or respective owner

Related posts about pdf

Related posts about ocr