Proper usage of JTidy to purify HTML
Posted
by Raj
on Stack Overflow
See other posts from Stack Overflow
or by Raj
Published on 2010-03-30T16:49:07Z
Indexed on
2010/03/30
16:53 UTC
Read the original article
Hit count: 742
Hello, I am trying to use JTidy (jtidy-r938.jar) to sanitize an input HTML string, but I seem to have problems getting the default settings right. Often strings such as "hello world" end up as "helloworld" after tidying. I wanted to show what I'm doing here, and any pointers would be really appreciated:
Assume that rawHtml
is the String containing the input (real world) HTML. This is what I'm doing:
InputStream is = new ByteArrayInputStream(rawHtml.getBytes("UTF-8"));
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setXHTML(true);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
tidy.parseDOM(is, baos);
String pure = baos.toString();
First off, does anything look fundamentally wrong with the above code? I seem to be getting weird results with this.
Thanks in advance!
© Stack Overflow or respective owner