How to know if a PDF contains only images or has been OCR scanned for searching?
Posted
by Bratch
on Stack Overflow
See other posts from Stack Overflow
or by Bratch
Published on 2009-09-28T22:45:42Z
Indexed on
2010/04/22
18:13 UTC
Read the original article
Hit count: 163
I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one large image, even where the whole page is entirely text. Others were scanned with OCR and contain images and searchable text where text is present. In many cases even words in the images were made searchable.
I want to make an automated process to recognize the text in all of the scanned documents using OCR, with Acrobat 8 Pro, but I don't want to re-OCR the files that have already been through the OCR process in the past. Does anyone know if there is a way to tell which ones contain only images, and which ones already contain searchable text?
I'm planning on doing this in C# or VB.NET but I don't think being able to tell the two kinds of files apart is language dependent.
© Stack Overflow or respective owner