Transform PDF to HTML, keep layout
- by Tgr
What methods are there to transform a PDF to HTML? It could be anything - online service, software, library. (Opensource preferred. In the last case, php or python would be preferred.) It has to keep the original layout (including page numbers, footnotes and such), keep the images (combining them to one single background image per page is acceptable) and keep the links. It should preferably output valid XHTML and clean up PDF features such as ligatures, but if there is some post-processing required, I can live with that. Something with a clean, relatively semantic HTML output would be great.
The closest one I found was zamzar.org, but it choked on links. (Also, the HTML output is an ugly heap of absolutely positioned divs and needs post-processing because of encoding problems.)