Parsing HTML with XPath and PHP

Posted by Peter on Stack Overflow See other posts from Stack Overflow or by Peter
Published on 2011-01-04T09:31:30Z Indexed on 2011/01/04 9:54 UTC
Read the original article Hit count: 234

Filed under:
|
|
|
|

Is there a way (using XPath and PHP) to do the following (WITHOUT external XSLT files)?

  • Remove all tables and their contents
  • Remove everything after the first h1 tag
  • Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc))

I received an XSLT answer here, but I'm looking for XPATH queries that don't require external files.

Currently, I've got the HTML in question loaded into a SimpleXmlElement via:

$doc = @DOMDocument::loadHTML($xml);
$data = simplexml_import_dom($doc);

Now I need help with:

$data = $data->xpath('??????');

Been working with this one for several days to no avail. I really appreciate the help.

Edit: I don't particularly care what's inside the paragraphs, as I can use strip_tags to eliminate what I don't want. All I need to do is to isolate the paragraphs from the rest of the source. I suppose a more specific, accurate requirement would be this:

Return only paragraphs (and their html contents) that aren't contained in tables, and only before the first h1 tag

© Stack Overflow or respective owner

Related posts about php

Related posts about regex