Ideal Java library for cleaning html, and escaping malformed fragments
Posted
by Tyler
on Stack Overflow
See other posts from Stack Overflow
or by Tyler
Published on 2010-03-01T19:12:27Z
Indexed on
2010/04/16
10:13 UTC
Read the original article
Hit count: 314
I've got some HTML files that need to be parsed and cleaned, and they occasionally have content with special characters like <, >, ", etc. which have not been properly escaped.
I have tried running the files through jTidy, but the best I can get it to do is just omit the content it sees as malformed html. Is there a different library that will just escape the malformed fragments instead of omitting them? If not, any recommendations on what library would be easiest to modify?
Clarification:
Sample input: <p> blah blah <M+1> blah </p>
Desired output: <p> blah blah <M+1> blah </p>
© Stack Overflow or respective owner