Scraping html WITHOUT uniquie identifiers using python
- by Nicholas Law
I would like to design an algorithm using python that scrapes thousands of pages like this one and this one, gathers all the data and inserts it into a MySQL database. The script will be run on a weekly or bi-weekly basis to update the database of any new information added to each individual page.
Ideally I would like a scraper that is easy to work with for table structured data but also data that does not have unique identifiers (ie. id and classes attributes).
Which scraper add-on should I use? BeautifulSoup, Scrapy or Mechanize?
Are there any particular tutorials/books I should be looking at for this desired result?
In the long-run I will be implementing a mobile app that works with all this data through querying the database.