Scrape HTML tables from a given URL into CSV

Posted by dreeves on Stack Overflow See other posts from Stack Overflow or by dreeves
Published on 2010-04-09T22:40:47Z Indexed on 2010/04/09 22:43 UTC
Read the original article Hit count: 511

Filed under:
|
|
|
|

I seek a tool that can be run on the command line like so:

tablescrape 'http://someURL.foo.com' [n]

If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.

Potential additional features:

  • To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill. The Perl module HTML::TableExtract can do this and may be good place to start for writing the tool I have in mind.
  • An option to asciify any unicode.
  • An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.

Related questions:

© Stack Overflow or respective owner

Related posts about scraping

Related posts about html