How to get a html elements with python lxml
Posted
by Damiano
on Stack Overflow
See other posts from Stack Overflow
or by Damiano
Published on 2010-05-10T23:50:03Z
Indexed on
2010/05/10
23:54 UTC
Read the original article
Hit count: 330
Hello!
I have this html code:
<table>
<tr>
<td class="test"><b><a href="">aaa</a></b></td>
<td class="test">bbb</td>
<td class="test">ccc</td>
<td class="test"><small>ddd</small></td>
</tr>
<tr>
<td class="test"><b><a href="">eee</a></b></td>
<td class="test">fff</td>
<td class="test">ggg</td>
<td class="test"><small>hhh</small></td>
</tr>
</table>
I use this Python code to extract all <td class="test">
with lxml module.
import urllib2
import lxml.html
code = urllib.urlopen("http://www.example.com/page.html").read()
html = lxml.html.fromstring(code)
result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]')
It works good! The result is:
<td class="test"><b><a href="">aaa</a></b></td>
<td class="test"><small>ddd</small></td>
<td class="test"><b><a href="">eee</a></b></td>
<td class="test"><small>hhh</small></td>
(so the first and the fourth column of each <tr>
)
Now, I have to extract:
aaa (the title of the link)
ddd (text between
<small>
tag)eee (the title of the link)
hhh (text between
<small>
tag)
How could I extract these values?
(the problem is that I have to remove <b>
tag and get the title of the anchor on the first column and remove <small>
tag on the forth column)
Thank you!
© Stack Overflow or respective owner