Scraping html WITHOUT uniquie identifiers using python
Posted
by
Nicholas Law
on Stack Overflow
See other posts from Stack Overflow
or by Nicholas Law
Published on 2013-10-22T21:43:36Z
Indexed on
2013/10/22
21:53 UTC
Read the original article
Hit count: 193
I would like to design an algorithm using python that scrapes thousands of pages like this one and this one, gathers all the data and inserts it into a MySQL database. The script will be run on a weekly or bi-weekly basis to update the database of any new information added to each individual page.
Ideally I would like a scraper that is easy to work with for table structured data but also data that does not have unique identifiers (ie. id and classes attributes).
Which scraper add-on should I use? BeautifulSoup, Scrapy or Mechanize?
Are there any particular tutorials/books I should be looking at for this desired result?
In the long-run I will be implementing a mobile app that works with all this data through querying the database.
© Stack Overflow or respective owner