How to best develop web crawlers
Posted
by Fernando Barrocal
on Stack Overflow
See other posts from Stack Overflow
or by Fernando Barrocal
Published on 2009-02-07T02:15:36Z
Indexed on
2010/04/05
3:53 UTC
Read the original article
Hit count: 753
crawler
|web-crawler
Heyall,
I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP.
The way I do is with a simple for
to iterate for the page list, a wget
do download it and sed
, tr
, awk
or other utilities to clean the page and grab the specific info I need.
All the process takes some time depending on the site and more to download all pages. And I often steps into an AJAX site that complicates everything
I was wondering if there is better ways to do that, faster ways or even some applications or languages to help such work.
© Stack Overflow or respective owner