Scrape HTML tables from a given URL into CSV
Posted
by dreeves
on Stack Overflow
See other posts from Stack Overflow
or by dreeves
Published on 2010-04-09T22:40:47Z
Indexed on
2010/04/09
22:43 UTC
Read the original article
Hit count: 511
I seek a tool that can be run on the command line like so:
tablescrape 'http://someURL.foo.com' [n]
If n
is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list.
If n
is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.
Potential additional features:
- To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill. The Perl module HTML::TableExtract can do this and may be good place to start for writing the tool I have in mind.
- An option to asciify any unicode.
- An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.
Related questions:
© Stack Overflow or respective owner