Stripping Non-Text from a Scanned, OCRd PDF

Posted by Daniel S. on Super User See other posts from Super User or by Daniel S.
Published on 2011-12-19T05:19:09Z Indexed on 2012/04/09 11:33 UTC
Read the original article Hit count: 312

Filed under:

pdf

|

ocr

I have a PDF created from a scanned document. OCR was used to recognize text. In Acrobat, if I select text, and click 'copy with formatting', I can paste the formatted text into Word, so it seems that fonts and colors are also embedded in the document in addition to just plain text and possibly the size.

Is there any way to use this information to create a PDF that just contains the formatted OCRd text, without the scanned image. Currently, my document only shows the scanned image, and the text is on an invisible layer. I would like to create a PDF document that removes the image that was scanned, and displays the formatted text that is currently hidden.

The following post has a section on "How can we make the invisible text visible?" PDF has an extra blank in all words after running through Ghostscript

However, doing this does not show the correct text formatting (that is retained when pasting in Word), and I also would like to remove the scanned image so that the final PDF just contains formatted (color, font, size) vector fonts, and no images.

© Super User or respective owner

Related posts about pdf

PDF Converter wanted: Convert 8.5*11 PDF images into 600*800px PDF images for the Nook

as seen on Super User - Search for 'Super User'
I have PDF files that are maritime charts, For example this one from the Delaware Bay http://ocsdata.ncd.noaa.gov/BookletChart/12304_BookletChart_HomeEd.pdf There is a lot of detailed information in the image. When I show them on a monitor the details are shown. When I put them on the Nook they… >>> More
Loop through values and display in a pdf file

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello all! i have written the following code: As you can see there is a for loop to go through some values and display them in the generated pdf. The problem is that all the values are being written at the same place. I have tried to insert a new line but it does not seem to work. Can anyone suggest… >>> More
convert scanned images pdf file to searchable pdf file

as seen on Super User - Search for 'Super User'
I have a pdf of a scanned book. I'm looking for a free software that will perform ocr and then provide an option to save it as pdf/doc. Is there one? Thanks. >>> More
Integrate Nitro PDF Reader with Windows 7

as seen on How to geek - Search for 'How to geek'
Would you like a lightweight PDF reader that integrates nicely with Office and Windows 7? Here we look at the new Nitro PDF Reader, a nice PDF viewer that also lets you create and markup PDF files. Adobe Reader is the de-facto PDF viewer, but it only lets you view PDFs and not much else. … >>> More
foxpro to pdf and pdf to foxpro

as seen on Stack Overflow - Search for 'Stack Overflow'
how will i convert pdf database converted from foxpro back to foxpro >>> More

Related posts about ocr

free open-source linux screenshot & ocr tool

as seen on Super User - Search for 'Super User'
I'm looking for a tool which would be able to capture a screen region, pass it to OCR and put the result into clipboard. "import ppm:- | gocr -i - | xclip -selection c" works, but gocr is unreliable: simple text on a webpage has errors. It is a clear font but the OCR tool always misses "r" and replaces… >>> More
OCR, OCR-B Fonts in PHP?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, I am looking for a good solution to parse OCR-B fonts off a PNG images fed from scanners. Any tips on a engine? In php >>> More
OCR with Neural network: data extraction

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm using the AForge library framework and its neural network. At the moment when I train my network I create lots of images (one image per letter per font) at a big size (30 pt), cut out the actual letter, scale this down to a smaller size (10x10 px) and then save it to my harddisk. I can then go… >>> More
OCR: How to improve accuracy - existing libraries for removing non-text 'furniture', shapes, etc to

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to remove rectangles etc that enclose text in a screenshot image, so that I can perform optical character recognition to get accurate text from the screenshot. Background: I doing this to extract data from a legacy application for use with other applications. This is the only way to get at… >>> More
OCR an RSA key fob (security token)

as seen on Stack Overflow - Search for 'Stack Overflow'
I put together a quick WinForm/embedded IE browser control which logs into our company's bank website each morning and scrapes/exports the desired deposit information (the bank is a smallish regional bank). Since we have a few dozen "pseudoaccounts" that draw from the same master account, this actually… >>> More