cleaning up pdftotext font issues
Posted
by
mankoff
on Super User
See other posts from Super User
or by mankoff
Published on 2010-12-09T23:06:55Z
Indexed on
2011/01/11
4:56 UTC
Read the original article
Hit count: 270
I'm using pdftotext
to make an ASCII version of a PDF document (made with LaTeX), because collaborators prefer a simple document in MS word.
The plain text version I see looks good, but upon closer inspection the f character seems to be frequently mis-converted depending on what characters follow. For example, fi and fl often seem to become one special character, which I will try to paste here: ? and ?.
What is the best way to clean up the output of pdftotext? I am thinking sed
might be the right tool, but am not sure how to detect these special characters.
© Super User or respective owner