Convert doc/docx to semantic HTML

Posted by sandstrom on Stack Overflow See other posts from Stack Overflow or by sandstrom
Published on 2009-08-26T15:06:56Z Indexed on 2010/03/16 9:26 UTC
Read the original article Hit count: 392

Filed under:
|
|
|
|

I would like to convert doc/docx documents to semantic HTML.

Some wishes/requirements:

  1. Semantic HTML such that headers in the document are <h1>, <h2> etc., tables are <table> and so forth.

  2. Should preferably be possible to handle headings, lists, tables and images. Graphs and math formulas is a nice extra.

• Doesn't have to be converted straight from doc/docx to html, could use an intermediary format, such as xml or docbook.

• Should work programatically, and with large number of documents.

The closest thing to a solution I've found so far is http://holloway.co.nz/docvert/index.html, but unfortunately there are many a few bugs, small user base and it can't handle a lot of documents. More of a proof of concept.

© Stack Overflow or respective owner

Related posts about html

Related posts about Xml