I'm using cyberneko to clean and process html documents.
I need to be able to process all the comments that occur in the original html documents.
I've configured the cyberneko sax parser to process comments like so:
parser.setProperty("http://xml.org/sax/properties/lexical-handler", consumer);
...using the same consumer as I am for DOM events.
I get a callback for each of the comments:
@Override
public void comment(char[] arg0, int arg1, int arg2) throws SAXException {
System.out.println("COMMENT::: "+new String(arg0, arg1, arg2));
}
The problem I have is that all the comments are processed first, out of context of the DOM. i.e. I get a callback for all the comments before the document head, body etc....
What I'd like is for the comment callbacks to occur in the order they occur in the DOM.
Edit: what I'm actually trying to do is parse the instructions for IE in the original html, such as:
<!--[if lte IE 6]><body class="news ie"><![endif]-->
At the moment they are all dropped, I need to include them in the cleaned HTML document.