Storing URLs while Spidering
- by itemio
I created a little web spider in python which I'm using to collect URLs. I'm not interested in the content. Right now I'm keeping all the visited URLs in a set in memory, because I don't want my spider to visit URLs twice. Of course that's a very limited way of accomplishing this.
So what's the best way to keep track of my visited URLs?
Should I use a database?
* which one? MySQL, sqlite, postgre?
* how should I save the URLs? As a primary key trying to insert every URL before visiting it?
Or should I write them to a file?
* one file?
* multiple files? how should I design the file-structure?
I'm sure there are books and a lot of papers on this or similar topics. Can you give me some advice what I should read?