Batch OCR for many PDF files (not already OCRed) ?

Posted by David on Super User See other posts from Super User or by David
Published on 2010-02-11T19:30:53Z Indexed on 2010/03/22 20:01 UTC
Read the original article Hit count: 402

Filed under:

pdf

|

ocr

|

desktop-search

Hello,

I use Google Desktop Search (I am on Vista) and not all my PDF files are recognized in my archive folder. It is normal as "PDF files that contain scanned images" are not indexed (http://desktop.google.com/support/bin/answer.py?hl=en&answer=90651)

So I would like to OCR many of my PDF files that are not already OCRed. My goal : I give the program a folder and it search alone in the subfolders the PDF files that need to be converted into PDF-OCRed files.

Note: In the past, if a PDF file was password protected, I removed the password with another batch (paying) tool: verypdf.com "pwdremover"

Any (not too much expensive) idea ?

I already tried : Finereader 6 pro on xp at the time, but there was no batch processor included... Paperfile paperfile.net which uses Tesseract code.google.com/p/tesseract-ocr/. But the OCR is only PDF to text, not PDF to PDF! There is also another project code.google.com/p/ocropus

Thanks in advance ;)

© Super User or respective owner

Related posts about pdf

PDF Converter wanted: Convert 8.5*11 PDF images into 600*800px PDF images for the Nook

as seen on Super User - Search for 'Super User'
I have PDF files that are maritime charts, For example this one from the Delaware Bay http://ocsdata.ncd.noaa.gov/BookletChart/12304_BookletChart_HomeEd.pdf There is a lot of detailed information in the image. When I show them on a monitor the details are shown. When I put them on the Nook they… >>> More
Loop through values and display in a pdf file

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello all! i have written the following code: As you can see there is a for loop to go through some values and display them in the generated pdf. The problem is that all the values are being written at the same place. I have tried to insert a new line but it does not seem to work. Can anyone suggest… >>> More
convert scanned images pdf file to searchable pdf file

as seen on Super User - Search for 'Super User'
I have a pdf of a scanned book. I'm looking for a free software that will perform ocr and then provide an option to save it as pdf/doc. Is there one? Thanks. >>> More
Integrate Nitro PDF Reader with Windows 7

as seen on How to geek - Search for 'How to geek'
Would you like a lightweight PDF reader that integrates nicely with Office and Windows 7? Here we look at the new Nitro PDF Reader, a nice PDF viewer that also lets you create and markup PDF files. Adobe Reader is the de-facto PDF viewer, but it only lets you view PDFs and not much else. … >>> More
foxpro to pdf and pdf to foxpro

as seen on Stack Overflow - Search for 'Stack Overflow'
how will i convert pdf database converted from foxpro back to foxpro >>> More

Related posts about ocr

free open-source linux screenshot & ocr tool

as seen on Super User - Search for 'Super User'
I'm looking for a tool which would be able to capture a screen region, pass it to OCR and put the result into clipboard. "import ppm:- | gocr -i - | xclip -selection c" works, but gocr is unreliable: simple text on a webpage has errors. It is a clear font but the OCR tool always misses "r" and replaces… >>> More
OCR, OCR-B Fonts in PHP?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, I am looking for a good solution to parse OCR-B fonts off a PNG images fed from scanners. Any tips on a engine? In php >>> More
OCR with Neural network: data extraction

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm using the AForge library framework and its neural network. At the moment when I train my network I create lots of images (one image per letter per font) at a big size (30 pt), cut out the actual letter, scale this down to a smaller size (10x10 px) and then save it to my harddisk. I can then go… >>> More
OCR: How to improve accuracy - existing libraries for removing non-text 'furniture', shapes, etc to

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to remove rectangles etc that enclose text in a screenshot image, so that I can perform optical character recognition to get accurate text from the screenshot. Background: I doing this to extract data from a legacy application for use with other applications. This is the only way to get at… >>> More
OCR an RSA key fob (security token)

as seen on Stack Overflow - Search for 'Stack Overflow'
I put together a quick WinForm/embedded IE browser control which logs into our company's bank website each morning and scrapes/exports the desired deposit information (the bank is a smallish regional bank). Since we have a few dozen "pseudoaccounts" that draw from the same master account, this actually… >>> More