Scrape HTML tables from a given URL into CSV
- by dreeves
I seek a tool that can be run on the command line like so:
tablescrape 'http://someURL.foo.com' [n]
If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list.
If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.
Potential additional features:
To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill. The Perl module HTML::TableExtract can do this and may be good place to start for writing the tool I have in mind.
An option to asciify any unicode.
An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.
Related questions:
http://stackoverflow.com/questions/259091/how-can-i-scrape-an-html-table-to-csv
http://stackoverflow.com/questions/1403087/how-can-i-convert-an-html-table-to-csv
http://stackoverflow.com/questions/2861/options-for-html-scraping