Storing URLs while Spidering

Posted by itemio on Stack Overflow See other posts from Stack Overflow or by itemio
Published on 2010-04-11T02:19:52Z Indexed on 2010/04/11 2:23 UTC
Read the original article Hit count: 444

Filed under:
|
|
|
|

I created a little web spider in python which I'm using to collect URLs. I'm not interested in the content. Right now I'm keeping all the visited URLs in a set in memory, because I don't want my spider to visit URLs twice. Of course that's a very limited way of accomplishing this.

So what's the best way to keep track of my visited URLs?

Should I use a database? * which one? MySQL, sqlite, postgre? * how should I save the URLs? As a primary key trying to insert every URL before visiting it?

Or should I write them to a file? * one file? * multiple files? how should I design the file-structure?

I'm sure there are books and a lot of papers on this or similar topics. Can you give me some advice what I should read?

© Stack Overflow or respective owner

Related posts about python

Related posts about spider