Url open encoding

Posted by badc0re on Stack Overflow See other posts from Stack Overflow or by badc0re
Published on 2012-06-28T08:56:53Z Indexed on 2012/06/28 9:16 UTC
Read the original article Hit count: 272

Filed under:

I have the following code for urllib and BeautifulSoup:

getSite = urllib.urlopen(pageName) # open current site   
getSitesoup = BeautifulSoup(getSite.read()) # reading the site content 
print getSitesoup.originalEncoding
for value in getSitesoup.find_all('link'): # extract all <a> tags 
    defLinks.append(value.get('href')) 

The result of it:

/usr/lib/python2.6/site-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
  "Some characters could not be decoded, and were "

And when i try to read the site i get:

?7?e????0*"I??G?H????F??????9-??????;??E?YÞBs????????????4i???)?????^W?????`w?Ke??%??*9?.'OQB???V??@?????]???(P??^??q?$?S5???tT*?Z

© Stack Overflow or respective owner

Related posts about python