Transform PDF to HTML, keep layout

Posted by Tgr on Stack Overflow See other posts from Stack Overflow or by Tgr
Published on 2010-05-08T13:36:29Z Indexed on 2010/05/08 13:58 UTC
Read the original article Hit count: 263

Filed under:
|
|

What methods are there to transform a PDF to HTML? It could be anything - online service, software, library. (Opensource preferred. In the last case, php or python would be preferred.) It has to keep the original layout (including page numbers, footnotes and such), keep the images (combining them to one single background image per page is acceptable) and keep the links. It should preferably output valid XHTML and clean up PDF features such as ligatures, but if there is some post-processing required, I can live with that. Something with a clean, relatively semantic HTML output would be great.

The closest one I found was zamzar.org, but it choked on links. (Also, the HTML output is an ugly heap of absolutely positioned divs and needs post-processing because of encoding problems.)

© Stack Overflow or respective owner

Related posts about pdf

Related posts about html