BeautifulSoup Parser Confusion - HTML

Posted by lyngbym on Stack Overflow See other posts from Stack Overflow or by lyngbym
Published on 2011-01-08T20:50:24Z Indexed on 2011/01/08 20:53 UTC
Read the original article Hit count: 263

Filed under:

beautifulsoup

I'm trying to scrape some content off another site and I'm not sure why BeautifulSoup is producing this output. It is only finding a blank space inside the match, but the real HTML contains a large amount of markup. I apologize if this is something stupid on my part. I'm new to python.

Here's my code:

import sys
import os
import mechanize
import re
from BeautifulSoup import BeautifulSoup

def scrape_trails(BASE_URL, data):
    #Get the trail names
    soup = BeautifulSoup(data)
    sitesDiv = soup.findAll("div", attrs={"id" : "sitesDiv"})
    print sitesDiv


def main():
    BASE_URL = "http://www.dnr.state.mn.us/skiing/skipass/list.html"
    br = mechanize.Browser()
    data = br.open(BASE_URL).get_data()
    links = scrape_trails(BASE_URL, data)


if __name__ == '__main__':
    main()

If you follow that URL you can see the sitesDiv contains a lot of markup. I'm not sure if I'm doing something wrong or if this is just malformed markup that the script can't handle. Thanks!

Related posts about beautifulsoup

Getting BeautifulSoup to find a specific <p>

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715… >>> More
Trying to grab just absolute links from a webpage using BeautifulSoup

as seen on Stack Overflow - Search for 'Stack Overflow'
I am reading the contents of a webpage using BeautifulSoup. What I want is to just grab the <a href> that start with http://. I know in beautifulsoup you can search by the attributes. I guess I am just having a syntax issue. I would imagine it would go something like. page = urllib2.urlopen("http://www… >>> More
Extracting an attribute value with beautifulsoup

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code: import urllib f = urllib.urlopen("http://58.68.130.147") s = f.read() f.close() from BeautifulSoup import BeautifulStoneSoup soup = BeautifulStoneSoup(s) inputTag =… >>> More
beautifulsoup and mechanize to get ajax call result

as seen on Stack Overflow - Search for 'Stack Overflow'
hi im building a scraper using python 2.5 and beautifulsoup but im stuble upon a problem ... part of the web page is generating after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters is there a way to simulate user interaction… >>> More
Python beautifulsoup trying to remove html tags 'span'

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to remove [<span class="street-address"> 510 E Airline Way </span>] and I have used this clean function to remove the one that is in between < > def clean(val): if type(val) is not StringType: val = str(val) val = re.sub(r'<.*?>', ''… >>> More

Developer IT