Scraping paginated items from a website using scrapy
- by Mridang Agarwalla
I'm using scrapy to scrape items from a site. I'm not being able to implement this scraping pattern. The site I'm trying to scrape is a forum and I scrape the site once a day.
Each page has a table containing posts. New posts are added to the top of the table and as more and more posts are posted to the site, the older posts go further into the pages due to pagination. This is a very simple scenario and we will assume that the order of the posts never change.
I would like to scrape this site and scrape all the "new" records until the last scraped post from yesterday is encountered. I have configured my spider to paginate endlessly and when it encounters yesterday's last scraped post, it should stop.
How can implement this?
(My Scrapy installation works with my Django installation using django-dynamic-scraper )