not readable PDF files
- by Michal_R
Hello,
I am writing Master's thesis - NLP system. I have one component - extractor. It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:
"¦xDn¦if|d+gDF"Ti&cD+lh d FÁhis~n +xd f«"d¦ffih »h"
or
"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"
I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semanticaly correct chars or strings of digits and letters)
Could anybody help me??? THX :)