Command line tool to query HTML elements (linux)
Posted
by
ipsec
on Super User
See other posts from Super User
or by ipsec
Published on 2012-11-18T12:01:45Z
Indexed on
2012/11/18
17:06 UTC
Read the original article
Hit count: 572
I am looking for a (linux) command line tool to parse HTML files and extract some elements, ideally with some XPath-like syntax.
I have the following requirements:
- It must be able to parse arbitrary HTML files (which may contain errors) in a robust manner
- It must be able to extract text of elements and attributes
What I have tried so far:
xmlstarlet: would be perfect, but mostly reports errors in files (e.g. entity not defined), even xml fo or htmltidy does not help.
xmllint: the best I have found so far, but is not able to extract attribute texts. Something like //a/@href
reports <a href="foo">
, what I need is just foo
. string(//a/@href)
works, but queries only the first entry. data
is not supported.
hxextract: works, but cannot extract attributes.
XQilla: would support XPath 2.0 and thus data
. It also support xqilla:parse-html
, but I have had no luck making this work.
Can you recommend me another tool?
© Super User or respective owner