Command line tool to query HTML elements (linux)
- by ipsec
I am looking for a (linux) command line tool to parse HTML files and extract some elements, ideally with some XPath-like syntax.
I have the following requirements:
It must be able to parse arbitrary HTML files (which may contain errors) in a robust manner
It must be able to extract text of elements and attributes
What I have tried so far:
xmlstarlet: would be perfect, but mostly reports errors in files (e.g. entity not defined), even xml fo or htmltidy does not help.
xmllint: the best I have found so far, but is not able to extract attribute texts. Something like //a/@href reports <a href="foo">, what I need is just foo. string(//a/@href) works, but queries only the first entry. data is not supported.
hxextract: works, but cannot extract attributes.
XQilla: would support XPath 2.0 and thus data. It also support xqilla:parse-html, but I have had no luck making this work.
Can you recommend me another tool?