layout analysis of text based pdf without ocr

Posted by fastrack on Super User See other posts from Super User or by fastrack
Published on 2013-07-03T04:30:15Z Indexed on 2013/07/03 5:08 UTC
Read the original article Hit count: 515

Filed under:
|
|
|

Before recognizing a pdf, OCR software do document layout analysis to determine which parts are texts, tables or images, as shown in the picture below.

![papercrop]http://cache.gawkerassets.com/assets/images/17/2011/07/papercrop.jpg

I want to use some parts of the text while leaving out the others. So having a software marking those zones comes in handy. Papercrop does a decent job, but it has a bug of now showing some of the text in the pdf file. And OCR software can also do layout analysis, marking out "zones" which I can add or delete. But you have to OCR to do that. Since my pdfs are already text based, I don't want to waste so much time OCRing.

So my question is, is there any software that automatically mark out those zones and let me manually manipulate them, without having to OCR?

Thanks! Waiting for your help.

© Super User or respective owner

Related posts about pdf

Related posts about ocr