Asynchronous crawling F#

Posted by jlezard on Stack Overflow See other posts from Stack Overflow or by jlezard
Published on 2010-06-11T10:08:34Z Indexed on 2010/06/11 10:12 UTC
Read the original article Hit count: 394

Filed under:
|

When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response.

let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) =
    async{

            let req =  (WebRequest.Create(uri)) :?> HttpWebRequest
            req.UserAgent<-"Mozilla"
            try 

                Thread.Sleep(timer)
                let! resp =    (req.AsyncGetResponse())
                Console.WriteLine(uri.AbsoluteUri+" got response")
                use stream = resp.GetResponseStream()
                use reader = new StreamReader(stream)
                let html = reader.ReadToEnd()
                return html
            with 
            | _ as ex -> return "Bad Link"
                 }

Then I do something like:

let uri1 = System.Uri "http://rue89.com"
let timer = 1000
let jobs = [|for i in 1..10 -> getHtmlPrimitiveAsyncTimer uri1 timer|]

jobs
|> Array.mapi(fun i job -> Console.WriteLine("Starting job "+string i)
                               Async.StartAsTask(job).Result)

Is this alright ? I am very unsure about 2 things: -Does the Thread.Sleep thing work for delaying the request ? -Is using StartTask a problem ?

I am a beginner (as you may have noticed) in F# (coding in general actually ), and everything envolving Threads scares me :)

Thanks !!

© Stack Overflow or respective owner

Related posts about F#

Related posts about webcrawling