Programmatically clean Word generated HTML while preserving styles?
Posted
by GeReV
on Stack Overflow
See other posts from Stack Overflow
or by GeReV
Published on 2010-05-10T21:46:40Z
Indexed on
2010/05/14
20:44 UTC
Read the original article
Hit count: 393
In my current company, we have this decade old... let's call it a "Hello World" application.
While wanting to create a newer version of it, we also want to preserve older entries.
These older entries contain hideous Word generated HTML which was never filtered before.
If and when we move to a newer system, I'd generally prefer to have that HTML cleaned and filtered in order to have the site comply with HTML standards as much as possible.
However, just cleaning that code like Jeff Atwood described in his blog or in any other way I know of would also ruin the style and formatting.
Now, that just might cause our users to revolt and then all hell will break loose... Not a very good idea.
Question is -- can Word's HTML be cleaned while preserving basic formatting? (e.g: coloring, italicized, bold text and so on)
Preferably using publicly available code or library, such as HTML Tidy, examples in C# would be much appreciated.
Thanks!
© Stack Overflow or respective owner