Trying to grab just absolute links from a webpage using BeautifulSoup
Posted
by Kevin
on Stack Overflow
See other posts from Stack Overflow
or by Kevin
Published on 2010-03-23T17:22:53Z
Indexed on
2010/03/23
18:23 UTC
Read the original article
Hit count: 559
python
|beautifulsoup
I am reading the contents of a webpage using BeautifulSoup. What I want is to just grab the <a href>
that start with http://
. I know in beautifulsoup you can search by the attributes. I guess I am just having a syntax issue. I would imagine it would go something like.
page = urllib2.urlopen("http://www.linkpages.com")
soup = BeautifulSoup(page)
for link in soup.findAll('a'):
if link['href'].startswith('http://'):
print links
But that returns:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'
Any ideas? Thanks in advance.
EDIT
This isn't for any site in particular. The script gets the url from the user. So internal link targets would be an issue, that's also why I only want the <'a'>
from the pages. If I turn it towards www.reddit.com
, it parses the beginning links and it gets to this:
<a href="http://www.reddit.com/top/">top</a>
<a href="http://www.reddit.com/saved/">saved</a>
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'
© Stack Overflow or respective owner