Ideal Java library for cleaning html, and escaping malformed fragments

Posted by Tyler on Stack Overflow See other posts from Stack Overflow or by Tyler
Published on 2010-03-01T19:12:27Z Indexed on 2010/04/16 10:13 UTC
Read the original article Hit count: 316

Filed under:
|
|

I've got some HTML files that need to be parsed and cleaned, and they occasionally have content with special characters like <, >, ", etc. which have not been properly escaped.

I have tried running the files through jTidy, but the best I can get it to do is just omit the content it sees as malformed html. Is there a different library that will just escape the malformed fragments instead of omitting them? If not, any recommendations on what library would be easiest to modify?

Clarification:

Sample input: <p> blah blah <M+1> blah </p>

Desired output: <p> blah blah &lt;M+1&gt; blah </p>

© Stack Overflow or respective owner

Related posts about java

Related posts about html