Command line tool to query HTML elements (linux)

Posted by ipsec on Super User See other posts from Super User or by ipsec
Published on 2012-11-18T12:01:45Z Indexed on 2012/11/18 17:06 UTC
Read the original article Hit count: 564

Filed under:
|
|

I am looking for a (linux) command line tool to parse HTML files and extract some elements, ideally with some XPath-like syntax.

I have the following requirements:

  • It must be able to parse arbitrary HTML files (which may contain errors) in a robust manner
  • It must be able to extract text of elements and attributes

What I have tried so far:

xmlstarlet: would be perfect, but mostly reports errors in files (e.g. entity not defined), even xml fo or htmltidy does not help.

xmllint: the best I have found so far, but is not able to extract attribute texts. Something like //a/@href reports <a href="foo">, what I need is just foo. string(//a/@href) works, but queries only the first entry. data is not supported.

hxextract: works, but cannot extract attributes.

XQilla: would support XPath 2.0 and thus data. It also support xqilla:parse-html, but I have had no luck making this work.

Can you recommend me another tool?

© Super User or respective owner

Related posts about html

Related posts about linux