How to best develop web crawlers
- by Fernando Barrocal
Heyall,
I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP.
The way I do is with a simple for to iterate for the page list, a wget do download it and sed, tr, awk or other utilities to clean the page and grab the specific info I need.
All the process takes some time depending on the site and more to download all pages. And I often steps into an AJAX site that complicates everything
I was wondering if there is better ways to do that, faster ways or even some applications or languages to help such work.