Preserve "long" spaces in PDFBox text extraction
Posted
by
Thilo
on Stack Overflow
See other posts from Stack Overflow
or by Thilo
Published on 2011-01-11T10:47:44Z
Indexed on
2011/01/11
10:54 UTC
Read the original article
Hit count: 284
I am using PDFBox to extract text from PDF. The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other
This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).
I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.
Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.
© Stack Overflow or respective owner