Asp.net Crawler Webresponse Operation Timed out.

Posted by Leon on Stack Overflow See other posts from Stack Overflow or by Leon
Published on 2010-05-18T05:46:26Z Indexed on 2010/05/18 5:50 UTC
Read the original article Hit count: 401

Hi I have built a simple threadpool based web crawler within my web application. Its job is to crawl its own application space and build a Lucene index of every valid web page and their meta content. Here's the problem. When I run the crawler from a debug server instance of Visual Studio Express, and provide the starting instance as the IIS url, it works fine. However, when I do not provide the IIS instance and it takes its own url to start the crawl process(ie. crawling its own domain space), I get hit by operation timed out exception on the Webresponse statement. Could someone please guide me into what I should or should not be doing here? Here is my code for fetching the page. It is executed in the multithreaded environment.

private static string GetWebText(string url)
    {
        string htmlText = "";        

        HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
        request.UserAgent = "My Crawler";

        using (WebResponse response = request.GetResponse())
        {
            using (Stream stream = response.GetResponseStream())
            {
                using (StreamReader reader = new StreamReader(stream))
                {
                    htmlText = reader.ReadToEnd();
                }
            }
        }
        return htmlText;
    }

And the following is my stacktrace:

at System.Net.HttpWebRequest.GetResponse() at CSharpCrawler.Crawler.GetWebText(String url) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 366 at CSharpCrawler.Crawler.CrawlPage(String url, List1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 105 at CSharpCrawler.Crawler.CrawlSiteBuildIndex(String hostUrl, String urlToBeginSearchFrom, List1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 89 at crawler_Default.threadedCrawlSiteBuildIndex(Object threadedCrawlerObj) in c:\myAppDev\myApp\site\crawler\Default.aspx.cs:line 108 at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state) at System.Threading.ExecutionContext.runTryCode(Object userData) at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData) at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx) at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch() at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()

Thanks and cheers, Leon.

© Stack Overflow or respective owner

Related posts about web-crawler

  • web crawler needed

    as seen on Stack Overflow - Search for 'Stack Overflow'
    does anybody know where i can get a free web crawler that actually works with minimal coding by me. ive googled it and can only find really old ones that dont work or openwebspider which doesnt seem to work. ideally id like to store just the web addresses and which links that page contains any suggestions… >>> More

  • Building an automatic web crawler

    as seen on Stack Overflow - Search for 'Stack Overflow'
    I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is… >>> More

  • Appengine Apps Vs Google bot web crawler

    as seen on Stack Overflow - Search for 'Stack Overflow'
    i built an appengine web app cricket.hover.in. The web app consists of about 15k url's linked in it, But even after a long time of my launch, no pages are indexed on google. Any base link place on my root site hover.in are being indexed with in minutes. but i placed the same link home page of root… >>> More

  • Extracting data from internet

    as seen on Programmers - Search for 'Programmers'
    I would like to extract data from internet like www.mozenda.com does but I want to write my own program to do that. Specific data I'm looking for is various event data. Based on my research, I think custom web crawler is my answer but I Would like to confirm the answer and see if there are any suggestion… >>> More

  • Web crawler update strategy

    as seen on Stack Overflow - Search for 'Stack Overflow'
    I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy. The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated… >>> More

Related posts about ASP.NET