Not crawling the same content twice
- by sirrocco
I'm building a small application that will crawl sites where the content is growing (like on stackoverflow) the difference is that the content once created is rarely modified.
Now , in the first pass I crawl all the pages in the site.
But next, the paged content of that site - I don't want to re-crawl all of it , just the latest additions.
So if the site has 500 pages, on the second pass if the site has 501 pages then I would only crawl the first and second pages. Would this be a good way to handle the situation ?
In the end, the crawled content will end up in lucene - creating a custom search engine.
So, I would like to avoid crawling multiple times the same content. Any better ideas ?
EDIT :
Let's say the site has a page : Results that will be accessed like so :
Results?page=1 , Results?page=2 ...etc
I guess that keeping a track of how many pages there were at the last crawl and just crawl the difference would be enough. ( maybe using a hash of each result on the page - if I start running into the same hashes - I should stop)