Parsing HTML with Python 2.7 - HTMLParser, SGMLParser, or Beautiful Soup?

Posted by Eric Wilson on Stack Overflow See other posts from Stack Overflow or by Eric Wilson
Published on 2011-06-27T14:11:55Z Indexed on 2013/11/11 3:54 UTC
Read the original article Hit count: 511

Filed under:

python

|

html

|

parsing

|

beautifulsoup

|

html-parsing

I want to do some screen-scraping with Python 2.7, and I have no context for the differences between HTMLParser, SGMLParser, or Beautiful Soup.

Are these all trying to solve the same problem, or do they exist for different reasons? Which is simplest, which is most robust, and which (if any) is the default choice?

Also, please let me know if I have overlooked a significant option.

Edit: I should mention that I'm not particularly experienced in HTML parsing, and I'm particularly interested in which will get me moving the quickest, with the goal of parsing HTML on one particular site.

© Stack Overflow or respective owner

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More

Related posts about html

Install usblib package - Ubuntu

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I need the package libusb for another package I am installing. I tried the following which seemed to install the package, sudo apt-get install libusb-dev but when I try to install the other package I get, configure: error: Package requirements (libusb-1.0 >= 0.9.1) were not met: No package… >>> More
Prevent malicious vulnerability scan increasing load on a server

as seen on Server Fault - Search for 'Server Fault'
Hi all, this week we have been suffering some malicious vulnerability scans to our servers, increasing the load on them, making them nearly unusable. The attack is easy to defend, just blocking the offending ip, but only after discovering it. Is there any form of prevent it? Is it normal that… >>> More
can't install psycopg2 in my env on mac os x lion

as seen on Server Fault - Search for 'Server Fault'
I tried install psycopg2 via pip in my virtual env, but got this error: ld: library not found for -lpq (full log here: http://pastebin.com/XdmGyJ4u ) I tried install postgres 9.1 from .dmg and via port, (gksks)iMac-Alexander:~ lorddaedra$ locate libpq /Developer/SDKs/MacOSX10.7.sdk/usr/include/libpq /Developer/SDKs/MacOSX10… >>> More
Bitnami redmine error SVN

as seen on Server Fault - Search for 'Server Fault'
I'm installing the Bitnami Redmine stack (redmine + subversion). Firstly I install configure and test it locally (Ubuntu 14.04 LTS). And everything is OK. I install Bitnami stack on server (Red Hat 4.4.7-4) and configure SVN. I commit files into SVN and connect project into Redmine with SVN repository… >>> More
Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
here is my htaccess file...... RewriteCond %{REQUEST_URI} ^/patients/billing/FAQ_billing\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/billing/getintouch\.html$ RewriteRule ^patients/billing/(.*)\.html$ $1.php [L,NC] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/a\.html$ [OR] RewriteCond… >>> More