Scrape HTML tables from a given URL into CSV

Posted by dreeves on Stack Overflow See other posts from Stack Overflow or by dreeves
Published on 2010-04-09T22:40:47Z Indexed on 2010/04/09 22:43 UTC
Read the original article Hit count: 546

Filed under:

scraping

|

html

|

parsing

|

csv

|

language-agnostic

I seek a tool that can be run on the command line like so:

tablescrape 'http://someURL.foo.com' [n]

If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.

Potential additional features:

To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill. The Perl module HTML::TableExtract can do this and may be good place to start for writing the tool I have in mind.
An option to asciify any unicode.
An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.

Related questions:

© Stack Overflow or respective owner

Related posts about scraping

Screen-scraping of a secure page of any site on https:// with asp.net in C#

as seen on Stack Overflow - Search for 'Stack Overflow'
I've done site scraping of secure page of any site on http:// but when I am trying to scrap any site on https:// then i always scrape the login page not secure page. Please advice what should i do for scraping a secure page of any site on https://. >>> More
looking for alternative to Webzinc .NET , screen scraping, web automation library for .net

as seen on Stack Overflow - Search for 'Stack Overflow'
i came across this .net library http://www.webzinc.com/online/faq.aspx however, i was wondering if there was a free alternative out there ? >>> More
PHP Screen Scraping Class

as seen on Bradino - Search for 'Bradino'
After some positive feedback I have decided to continue to develop the PHP Screen Scraping class. This post will server as the permanent home for the class. Download PHP Screen Scraping Class Updates 20009-07-30 Added setHeader() function >>> More
Alert Log Scraping with Oracle&#146;s ADRCI Utility

as seen on Internet.com - Search for 'Internet.com'
Oracles new ADR with command interface shows promise for database administrators who like to script their own solution for quickly scraping the alert log and automatically looking for errors. >>> More
Web scraping etiquette

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm considering writing a simple web scraping application to extract information from a website that does not seem to specifically prohibit this. I've checked for other alternatives (eg RSS, web service) to get this information, but there are none available at this stage. Despite this I've also… >>> More

Related posts about html

Install usblib package - Ubuntu

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I need the package libusb for another package I am installing. I tried the following which seemed to install the package, sudo apt-get install libusb-dev but when I try to install the other package I get, configure: error: Package requirements (libusb-1.0 >= 0.9.1) were not met: No package… >>> More
Prevent malicious vulnerability scan increasing load on a server

as seen on Server Fault - Search for 'Server Fault'
Hi all, this week we have been suffering some malicious vulnerability scans to our servers, increasing the load on them, making them nearly unusable. The attack is easy to defend, just blocking the offending ip, but only after discovering it. Is there any form of prevent it? Is it normal that… >>> More
can't install psycopg2 in my env on mac os x lion

as seen on Server Fault - Search for 'Server Fault'
I tried install psycopg2 via pip in my virtual env, but got this error: ld: library not found for -lpq (full log here: http://pastebin.com/XdmGyJ4u ) I tried install postgres 9.1 from .dmg and via port, (gksks)iMac-Alexander:~ lorddaedra$ locate libpq /Developer/SDKs/MacOSX10.7.sdk/usr/include/libpq /Developer/SDKs/MacOSX10… >>> More
Bitnami redmine error SVN

as seen on Server Fault - Search for 'Server Fault'
I'm installing the Bitnami Redmine stack (redmine + subversion). Firstly I install configure and test it locally (Ubuntu 14.04 LTS). And everything is OK. I install Bitnami stack on server (Red Hat 4.4.7-4) and configure SVN. I commit files into SVN and connect project into Redmine with SVN repository… >>> More
Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
here is my htaccess file...... RewriteCond %{REQUEST_URI} ^/patients/billing/FAQ_billing\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/billing/getintouch\.html$ RewriteRule ^patients/billing/(.*)\.html$ $1.php [L,NC] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/a\.html$ [OR] RewriteCond… >>> More