BeautifulSoup can't parse a webpage?
- by JLTChiu
I am using beautiful soup for parsing webpage now, I've heard it's very famous and good, but it doesn't seems works properly.
Here's what I did
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1")
soup = BeautifulSoup(page)
print soup.prettify()
I think this is kind of straightforward. I open the webpage and pass it to the beautifulsoup. But here's what I got:
Warning (from warnings module):
File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149
"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
...
HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94
I thought CNN website should be well designed, so I am not very sure what's going on though. Does anyone has idea about this?