Getting BeautifulSoup to find a specific <p>
Posted
by Ryan
on Stack Overflow
See other posts from Stack Overflow
or by Ryan
Published on 2010-03-26T06:32:42Z
Indexed on
2010/03/26
7:53 UTC
Read the original article
Hit count: 713
I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph.
The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html.
I can't get the abstract out of that page, however. I'm searching for everything between the <p class="lead">...</p>
tags, but I can't seem to figure out how to isolate them. I thought it would be something simple like
from BeautifulSoup import BeautifulSoup
import re
import urllib2
address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html"
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
abstract = soup.find('p', attrs={'class' : 'lead'})
print abstract
Using Python 2.5, BeautifulSoup 3.0.8, running this returns 'None'. I have no option of using anything else that needs to be compiled/installed (like lxml). Is BeautifulSoup confused, or am I?
© Stack Overflow or respective owner