Programmatically clean Word generated HTML while preserving styles?

Posted by GeReV on Stack Overflow See other posts from Stack Overflow or by GeReV
Published on 2010-05-10T21:46:40Z Indexed on 2010/05/14 20:44 UTC
Read the original article Hit count: 393

Filed under:
|
|
|
|

In my current company, we have this decade old... let's call it a "Hello World" application.

While wanting to create a newer version of it, we also want to preserve older entries.
These older entries contain hideous Word generated HTML which was never filtered before.

If and when we move to a newer system, I'd generally prefer to have that HTML cleaned and filtered in order to have the site comply with HTML standards as much as possible.
However, just cleaning that code like Jeff Atwood described in his blog or in any other way I know of would also ruin the style and formatting.

Now, that just might cause our users to revolt and then all hell will break loose... Not a very good idea.

Question is -- can Word's HTML be cleaned while preserving basic formatting? (e.g: coloring, italicized, bold text and so on)

Preferably using publicly available code or library, such as HTML Tidy, examples in C# would be much appreciated.

Thanks!

© Stack Overflow or respective owner

Related posts about .NET

Related posts about html