extract specific element from nested elements using lxml html

Posted by Dan.StackOverflow on Stack Overflow See other posts from Stack Overflow or by Dan.StackOverflow
Published on 2010-04-14T04:40:08Z Indexed on 2010/04/14 4:43 UTC
Read the original article Hit count: 444

Filed under:
|
|
|
|

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier.

<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>

What I really want is the deeply nested table, because it has the header text "Header1". I am trying like so:

from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')

but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex. Any thoughts?

© Stack Overflow or respective owner

Related posts about python

Related posts about html