Scraping paginated items from a website using scrapy
Posted
by
Mridang Agarwalla
on Stack Overflow
See other posts from Stack Overflow
or by Mridang Agarwalla
Published on 2012-10-16T16:58:04Z
Indexed on
2012/10/16
17:00 UTC
Read the original article
Hit count: 294
I'm using scrapy to scrape items from a site. I'm not being able to implement this scraping pattern. The site I'm trying to scrape is a forum and I scrape the site once a day.
Each page has a table containing posts. New posts are added to the top of the table and as more and more posts are posted to the site, the older posts go further into the pages due to pagination. This is a very simple scenario and we will assume that the order of the posts never change.
I would like to scrape this site and scrape all the "new" records until the last scraped post from yesterday is encountered. I have configured my spider to paginate endlessly and when it encounters yesterday's last scraped post, it should stop.
How can implement this?
(My Scrapy installation works with my Django installation using django-dynamic-scraper )
© Stack Overflow or respective owner