Vote on Pros and Cons of Java HTML to XML cleaners
Posted
by
George Bailey
on Stack Overflow
See other posts from Stack Overflow
or by George Bailey
Published on 2010-12-21T16:44:47Z
Indexed on
2011/02/04
15:26 UTC
Read the original article
Hit count: 472
I am looking to allow HTML emails (and other HTML uploads) without letting in scripts and stuff. I plan to have a white list of safe tags and attributes as well as a whitelist of CSS tags and value regexes (to prevent automatic return receipt).
I asked a question: Parse a badly formatted XML document (like an HTML file)
I found there are many many ways to do this. Some systems have built in sanitizers (which I don't care so much about). This page is a very nice listing page but I get kinda lost http://java-source.net/open-source/html-parsers
It is very important that the parsers never throw an exception. There should always be best guess results to the parse/clean. It is also very important that the result is valid XML that can be traversed in Java.
I posted some product information and said Community Wiki. Please post any other product suggestions you like and say Community Wiki so they can be voted on.
Also any comments or wiki edits on what part of a certain product is better and what is not would be greatly appreciated. (for example,, speed vs accuracy..)
It seems that we will go with either jsoup (seems more active and up to date) or TagSoup (compatible with JDK4 and been around awhile).
A +1 for any of these products would be if they could convert all style sheets into inline style on the elements.
© Stack Overflow or respective owner