Proper usage of JTidy to purify HTML
- by Raj
Hello,
I am trying to use JTidy (jtidy-r938.jar) to sanitize an input HTML string, but I seem to have problems getting the default settings right. Often strings such as "hello world" end up as "helloworld" after tidying. I wanted to show what I'm doing here, and any pointers would be really appreciated:
Assume that rawHtml is the String containing the input (real world) HTML. This is what I'm doing:
InputStream is = new ByteArrayInputStream(rawHtml.getBytes("UTF-8"));
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setXHTML(true);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
tidy.parseDOM(is, baos);
String pure = baos.toString();
First off, does anything look fundamentally wrong with the above code? I seem to be getting weird results with this.
Thanks in advance!