Parsing HTML with XPath and PHP
- by Peter
Is there a way (using XPath and PHP) to do the following (WITHOUT external XSLT files)?
Remove all tables and their contents
Remove everything after the first h1 tag
Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc))
I received an XSLT answer here, but I'm looking for XPATH queries that don't require external files.
Currently, I've got the HTML in question loaded into a SimpleXmlElement via:
$doc = @DOMDocument::loadHTML($xml);
$data = simplexml_import_dom($doc);
Now I need help with:
$data = $data->xpath('??????');
Been working with this one for several days to no avail. I really appreciate the help.
Edit: I don't particularly care what's inside the paragraphs, as I can use strip_tags to eliminate what I don't want. All I need to do is to isolate the paragraphs from the rest of the source. I suppose a more specific, accurate requirement would be this:
Return only paragraphs (and their html contents) that aren't contained in tables, and only before the first h1 tag