extract specific element from nested elements using lxml html
Posted
by Dan.StackOverflow
on Stack Overflow
See other posts from Stack Overflow
or by Dan.StackOverflow
Published on 2010-04-14T04:40:08Z
Indexed on
2010/04/14
4:43 UTC
Read the original article
Hit count: 444
Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier.
<table>
<tr>
<td>
<table>
<tr><td></td></tr>
<tr><td>
<table>
<tr><td><u><b>Header1</b></u></td></tr>
<tr><td>Data</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
What I really want is the deeply nested table, because it has the header text "Header1". I am trying like so:
from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')
but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex. Any thoughts?
© Stack Overflow or respective owner