Does JAXP natively parse HTML?

Posted by ikmac on Programmers See other posts from Programmers or by ikmac
Published on 2012-10-03T02:02:15Z Indexed on 2012/10/03 3:50 UTC
Read the original article Hit count: 480

Filed under:
|

So, I whip up a quick test case in Java 7 to grab a couple of elements from random URIs, and see if the built-in parsing stuff will do what I need.

Here's the basic setup (with exception handling etc omitted):

DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
DocumentBuilder dbuild = dbfac.newDocumentBuilder();
Document doc = dbuild.parse("uri-goes-here");

With no error handler installed, the parse method throws exceptions on fatal parse errors.

When getting the standard Apache 2.2 directory index page from a local server: a SAXParseException with the message White spaces are required between publicId and systemId. The doctype looks ok to me, whitespace and all.

When getting a page off a Drupal 7 generated site, it never finishes. The parse method seems to hang. No exceptions thrown, never returns.

When getting http://www.oracle.com, a SAXParseException with the message The element type "meta" must be terminated by the matching end-tag "</meta>".


So it would appear that the default setup I've used here doesn't handle HTML, only strictly written XML.

My question is: can JAXP be used out-of-the-box from openJDK 7 to parse HTML from the wild (without insane gesticulations), or am I better off looking for an HTML 5 parser?

PS this is for something I may not open-source, so licensing is also an issue :(

© Programmers or respective owner

Related posts about html

Related posts about parsing