cleaning up pdftotext font issues
- by mankoff
I'm using pdftotext to make an ASCII version of a PDF document (made with LaTeX), because collaborators prefer a simple document in MS word.
The plain text version I see looks good, but upon closer inspection the f character seems to be frequently mis-converted depending on what characters follow. For example, fi and fl often seem to become one special character, which I will try to paste here: ? and ?.
What is the best way to clean up the output of pdftotext? I am thinking sed might be the right tool, but am not sure how to detect these special characters.