Preserve "long" spaces in PDFBox text extraction
- by Thilo
I am using PDFBox to extract text from PDF.
The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other
This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).
I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.
Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.