layout analysis of text based pdf without ocr
- by fastrack
Before recognizing a pdf, OCR software do document layout analysis to determine
which parts are texts, tables or images, as shown in the picture below.
![papercrop]http://cache.gawkerassets.com/assets/images/17/2011/07/papercrop.jpg
I want to use some parts of the text while leaving out the others. So having a
software marking those zones comes in handy. Papercrop does a decent job, but it
has a bug of now showing some of the text in the pdf file. And OCR software can
also do layout analysis, marking out "zones" which I can add or delete. But you
have to OCR to do that. Since my pdfs are already text based, I don't want to
waste so much time OCRing.
So my question is, is there any software that automatically mark out those zones
and let me manually manipulate them, without having to OCR?
Thanks! Waiting for your help.