Does JAXP natively parse HTML?
- by ikmac
So, I whip up a quick test case in Java 7 to grab a couple of elements from random URIs, and see if the built-in parsing stuff will do what I need.
Here's the basic setup (with exception handling etc omitted):
DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
DocumentBuilder dbuild = dbfac.newDocumentBuilder();
Document doc = dbuild.parse("uri-goes-here");
With no error handler installed, the parse method throws exceptions on fatal parse errors.
When getting the standard Apache 2.2 directory index page from a local server: a SAXParseException with the message White spaces are required between publicId and systemId. The doctype looks ok to me, whitespace and all.
When getting a page off a Drupal 7 generated site, it never finishes. The parse method seems to hang. No exceptions thrown, never returns.
When getting http://www.oracle.com, a SAXParseException with the message The element type "meta" must be terminated by the matching end-tag "</meta>".
So it would appear that the default setup I've used here doesn't handle HTML, only strictly written XML.
My question is: can JAXP be used out-of-the-box from openJDK 7 to parse HTML from the wild (without insane gesticulations), or am I better off looking for an HTML 5 parser?
PS this is for something I may not open-source, so licensing is also an issue :(