Parsing HTML with XPath and PHP
Posted
by
Peter
on Stack Overflow
See other posts from Stack Overflow
or by Peter
Published on 2011-01-04T09:31:30Z
Indexed on
2011/01/04
9:54 UTC
Read the original article
Hit count: 240
Is there a way (using XPath and PHP) to do the following (WITHOUT external XSLT files)?
- Remove all tables and their contents
- Remove everything after the first h1 tag
- Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc))
I received an XSLT answer here, but I'm looking for XPATH queries that don't require external files.
Currently, I've got the HTML in question loaded into a SimpleXmlElement via:
$doc = @DOMDocument::loadHTML($xml);
$data = simplexml_import_dom($doc);
Now I need help with:
$data = $data->xpath('??????');
Been working with this one for several days to no avail. I really appreciate the help.
Edit: I don't particularly care what's inside the paragraphs, as I can use strip_tags to eliminate what I don't want. All I need to do is to isolate the paragraphs from the rest of the source. I suppose a more specific, accurate requirement would be this:
Return only paragraphs (and their html contents) that aren't contained in tables, and only before the first h1 tag
© Stack Overflow or respective owner