scraping text from multiple html files into a single csv file

Posted by Lulu on Stack Overflow See other posts from Stack Overflow or by Lulu
Published on 2011-01-11T14:00:41Z Indexed on 2011/01/15 1:53 UTC
Read the original article Hit count: 591

Filed under:

html-parsing

|

beautifulsoup

I have just over 1500 html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but "misses" out some of the data within the table.

My Input: e.g file 1500.html

My Code:

#!/usr/bin/env python
import glob
import codecs
from BeautifulSoup import BeautifulSoup
with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile:
for file in glob.glob('*html*'):
        print 'Processing', file
        soup = BeautifulSoup(open(file).read())
        rows = soup.findAll('tr')
        for tr in rows:
                cols = tr.findAll('td')
                #print >> csvfile,"#".join(col.string for col in cols)
                #print >> csvfile,"#".join(td.find(text=True))
                for col in cols:
                        print >> csvfile, col.string
                print >> csvfile, "==="
        print >> csvfile, "***"

Output:

One CSV file, with 1500 lines of text and columns of data. For some reason my code does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data at the start of the table do not come out. I modified the code to put in * and === separators, I then use perl to put into a clean csv file, unfortunately I'm not sure how to work my code to get all the data I'm looking for!

© Stack Overflow or respective owner

Related posts about html-parsing

html parsing with libxml

as seen on Stack Overflow - Search for 'Stack Overflow'
In another thread I got convinced into using HTML parsers instead of regexps for HTML parsing (I thought they would work fine, but they didn't ;) ). I thought of using libxml (it has some HTML parser built in), but failed to find any useful tutorial. I also found this site and it says here it should… >>> More
RUBY Nokogiri CSS HTML Parsing

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm having some problems trying to get the code below to output the data in the format that I want. What I'm after is the following: CCC1-$5.00 CCC1-$10.00 CCC1-$15.00 CCC2-$7.00 where $7 belongs to CCC2 and the others to CCC1, but I can only manage to get the data in this format: … >>> More
HTML parsing - fetch and update data from the .html file

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a form in a .html files where input/select box looks like this <input type="text" id="txtName" name="txtName" value="##myName##" /> <select id="cbGender" name="cbGender"> <option>Select</option> <option selected="selected">Male</option> <option>Female</option> </select> I… >>> More
html parsing in c#

as seen on Stack Overflow - Search for 'Stack Overflow'
hi, How can i parse values from the scoreboard of http://www.cricinfo.com/nzvaus2010/engine/current/match/423789.html Any help will be appreciated. >>> More
Java HTML Parsing

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello everyone. I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for "div class = "classname""… >>> More

Related posts about beautifulsoup

Getting BeautifulSoup to find a specific <p>

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715… >>> More
Trying to grab just absolute links from a webpage using BeautifulSoup

as seen on Stack Overflow - Search for 'Stack Overflow'
I am reading the contents of a webpage using BeautifulSoup. What I want is to just grab the <a href> that start with http://. I know in beautifulsoup you can search by the attributes. I guess I am just having a syntax issue. I would imagine it would go something like. page = urllib2.urlopen("http://www… >>> More
Extracting an attribute value with beautifulsoup

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code: import urllib f = urllib.urlopen("http://58.68.130.147") s = f.read() f.close() from BeautifulSoup import BeautifulStoneSoup soup = BeautifulStoneSoup(s) inputTag =… >>> More
beautifulsoup and mechanize to get ajax call result

as seen on Stack Overflow - Search for 'Stack Overflow'
hi im building a scraper using python 2.5 and beautifulsoup but im stuble upon a problem ... part of the web page is generating after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters is there a way to simulate user interaction… >>> More
Python beautifulsoup trying to remove html tags 'span'

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to remove [<span class="street-address"> 510 E Airline Way </span>] and I have used this clean function to remove the one that is in between < > def clean(val): if type(val) is not StringType: val = str(val) val = re.sub(r'<.*?>', ''… >>> More