cleaning up pdftotext font issues

Posted by mankoff on Super User See other posts from Super User or by mankoff
Published on 2010-12-09T23:06:55Z Indexed on 2011/01/11 4:56 UTC
Read the original article Hit count: 270

Filed under:
|
|
|

I'm using pdftotext to make an ASCII version of a PDF document (made with LaTeX), because collaborators prefer a simple document in MS word.

The plain text version I see looks good, but upon closer inspection the f character seems to be frequently mis-converted depending on what characters follow. For example, fi and fl often seem to become one special character, which I will try to paste here: ? and ?.

What is the best way to clean up the output of pdftotext? I am thinking sed might be the right tool, but am not sure how to detect these special characters.

© Super User or respective owner

Related posts about pdf

Related posts about convert