Parse html and find data in the html

Posted by Dan.StackOverflow on Stack Overflow See other posts from Stack Overflow or by Dan.StackOverflow
Published on 2010-04-01T04:04:03Z Indexed on 2010/04/01 4:13 UTC
Read the original article Hit count: 1207

Filed under:

python

|

html5lib

|

lxml

|

parsing

|

xpath

Hi all. I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:

<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>

so lets try it:

>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>

that looks good, lets see what else we have:

>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>

LOL WUT?

seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.

© Stack Overflow or respective owner

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More

Related posts about html5lib

Skip sanitization for videos in html5lib

as seen on Stack Overflow - Search for 'Stack Overflow'
I am using a wmd-editor in django, much like this one in which I am typing. I would like to allow the users to embed videos in it. For that I am using the Markdown video extension here. The problem is that I am also sanitizing user input using html5lib sanitization and it doesn't allow object tags… >>> More
Which revision of html5lib is stable?

as seen on Stack Overflow - Search for 'Stack Overflow'
html5lib notes that it's latest release (0.11) is somewhat old. Using the Python portion, I have recursion problems as noted in Issue 70 and Issue 59 but can't find a recent Mercurial revision that is stable. The latest tip is no good, I got the following error from python setup.py install: byte-compiling… >>> More
Need to parse HTML document for links-- use a library like html5lib or something else?

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm a very newbie webpage builder, currently working on creating a website that needs to change link colours according to the destination page. The links will be sorted into different classes (e.g. good, bad, neutral) by certain user input criteria-- e.g. links with content the user would find of… >>> More
Parse html and find data in the html

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi all. I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table: <html> <table> … >>> More
Parsing with BeautifulSoup, error message TypeError: coercing to Unicode: need string or buffer, NoneType found

as seen on Stack Overflow - Search for 'Stack Overflow'
so I'm trying to scrape an Amazon page for data, and I'm getting an error when I try to parse for where the seller is located. Here's my code: #getting the html request = urllib2.Request('http://www.amazon.com/gp/offer-listing/0393934241/') opener = urllib2.build_opener() #hiding that I'm a webscraper request… >>> More