Storing URLs while Spidering
Posted
by itemio
on Stack Overflow
See other posts from Stack Overflow
or by itemio
Published on 2010-04-11T02:19:52Z
Indexed on
2010/04/11
2:23 UTC
Read the original article
Hit count: 444
I created a little web spider in python which I'm using to collect URLs. I'm not interested in the content. Right now I'm keeping all the visited URLs in a set in memory, because I don't want my spider to visit URLs twice. Of course that's a very limited way of accomplishing this.
So what's the best way to keep track of my visited URLs?
Should I use a database? * which one? MySQL, sqlite, postgre? * how should I save the URLs? As a primary key trying to insert every URL before visiting it?
Or should I write them to a file? * one file? * multiple files? how should I design the file-structure?
I'm sure there are books and a lot of papers on this or similar topics. Can you give me some advice what I should read?
© Stack Overflow or respective owner