Asynchronous crawling F#
- by jlezard
When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response.
let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) =
async{
let req = (WebRequest.Create(uri)) :?> HttpWebRequest
req.UserAgent<-"Mozilla"
try
Thread.Sleep(timer)
let! resp = (req.AsyncGetResponse())
Console.WriteLine(uri.AbsoluteUri+" got response")
use stream = resp.GetResponseStream()
use reader = new StreamReader(stream)
let html = reader.ReadToEnd()
return html
with
| _ as ex -> return "Bad Link"
}
Then I do something like:
let uri1 = System.Uri "http://rue89.com"
let timer = 1000
let jobs = [|for i in 1..10 -> getHtmlPrimitiveAsyncTimer uri1 timer|]
jobs
|> Array.mapi(fun i job -> Console.WriteLine("Starting job "+string i)
Async.StartAsTask(job).Result)
Is this alright ? I am very unsure about 2 things:
-Does the Thread.Sleep thing work for delaying the request ?
-Is using StartTask a problem ?
I am a beginner (as you may have noticed) in F# (coding in general actually ), and everything envolving Threads scares me :)
Thanks !!