Stripping Non-Text from a Scanned, OCRd PDF
Posted
by
Daniel S.
on Super User
See other posts from Super User
or by Daniel S.
Published on 2011-12-19T05:19:09Z
Indexed on
2012/04/09
11:33 UTC
Read the original article
Hit count: 289
I have a PDF created from a scanned document. OCR was used to recognize text. In Acrobat, if I select text, and click 'copy with formatting', I can paste the formatted text into Word, so it seems that fonts and colors are also embedded in the document in addition to just plain text and possibly the size.
Is there any way to use this information to create a PDF that just contains the formatted OCRd text, without the scanned image. Currently, my document only shows the scanned image, and the text is on an invisible layer. I would like to create a PDF document that removes the image that was scanned, and displays the formatted text that is currently hidden.
The following post has a section on "How can we make the invisible text visible?" PDF has an extra blank in all words after running through Ghostscript
However, doing this does not show the correct text formatting (that is retained when pasting in Word), and I also would like to remove the scanned image so that the final PDF just contains formatted (color, font, size) vector fonts, and no images.
© Super User or respective owner