Search Results

Search found 24 results on 1 pages for 'webcrawling'.

Page 1/1 | 1 

  • Asynchronous Webcrawling F#, something wrong ?

    - by jlezard
    Not quite sure if it is ok to do this but, my question is: Is there something wrong with my code ? It doesn't go as fast as I would like, and since I am using lots of async workflows maybe I am doing something wrong. The goal here is to build something that can crawl 20 000 pages in less than an hour. open System open System.Text open System.Net open System.IO open System.Text.RegularExpressions open System.Collections.Generic open System.ComponentModel open Microsoft.FSharp open System.Threading //This is the Parallel.Fs file type ComparableUri ( uri: string ) = inherit System.Uri( uri ) let elts (uri:System.Uri) = uri.Scheme, uri.Host, uri.Port, uri.Segments interface System.IComparable with member this.CompareTo( uri2 ) = compare (elts this) (elts(uri2 :?> ComparableUri)) override this.Equals(uri2) = compare this (uri2 :?> ComparableUri ) = 0 override this.GetHashCode() = 0 ///////////////////////////////////////////////Funtions to retreive html string////////////////////////////// let mutable error = Set.empty<ComparableUri> let mutable visited = Set.empty<ComparableUri> let getHtmlPrimitiveAsyncDelay (delay:int) (uri : ComparableUri) = async{ try let req = (WebRequest.Create(uri)) :?> HttpWebRequest // 'use' is equivalent to ‘using’ in C# for an IDisposable req.UserAgent<-"Mozilla" //Console.WriteLine("Waiting") do! Async.Sleep(delay * 250) let! resp = (req.AsyncGetResponse()) Console.WriteLine(uri.AbsoluteUri+" got response after delay "+string delay) use stream = resp.GetResponseStream() use reader = new StreamReader(stream) let html = reader.ReadToEnd() return html with | _ as ex -> Console.WriteLine( ex.ToString() ) lock error (fun () -> error<- error.Add uri ) lock visited (fun () -> visited<-visited.Add uri ) return "BadUri" } ///////////////////////////////////////////////Active Pattern Matching to retreive href////////////////////////////// let (|Matches|_|) (pat:string) (inp:string) = let m = Regex.Matches(inp, pat) // Note the List.tl, since the first group is always the entirety of the matched string. if m.Count > 0 then Some (List.tail [ for g in m -> g.Value ]) else None let (|Match|_|) (pat:string) (inp:string) = let m = Regex.Match(inp, pat) // Note the List.tl, since the first group is always the entirety of the matched string. if m.Success then Some (List.tail [ for g in m.Groups -> g.Value ]) else None ///////////////////////////////////////////////Find Bad href////////////////////////////// let isEmail (link:string) = link.Contains("@") let isMailto (link:string) = if Seq.length link >=6 then link.[0..5] = "mailto" else false let isJavascript (link:string) = if Seq.length link >=10 then link.[0..9] = "javascript" else false let isBadUri (link:string) = link="BadUri" let isEmptyHttp (link:string) = link="http://" let isFile (link:string)= if Seq.length link >=6 then link.[0..5] = "file:/" else false let containsPipe (link:string) = link.Contains("|") let isAdLink (link:string) = if Seq.length link >=6 then link.[0..5] = "adlink" elif Seq.length link >=9 then link.[0..8] = "http://adLink" else false ///////////////////////////////////////////////Find Bad href////////////////////////////// let getHref (htmlString:string) = let urlPat = "href=\"([^\"]+)" match htmlString with | Matches urlPat urls -> urls |> List.map( fun href -> match href with | Match (urlPat) (link::[]) -> link | _ -> failwith "The href was not in correct format, there was more than one match" ) | _ -> Console.WriteLine( "No links for this page" );[] |> List.filter( fun link -> not(isEmail link) ) |> List.filter( fun link -> not(isMailto link) ) |> List.filter( fun link -> not(isJavascript link) ) |> List.filter( fun link -> not(isBadUri link) ) |> List.filter( fun link -> not(isEmptyHttp link) ) |> List.filter( fun link -> not(isFile link) ) |> List.filter( fun link -> not(containsPipe link) ) |> List.filter( fun link -> not(isAdLink link) ) let treatAjax (href:System.Uri) = let link = href.ToString() let firstPart = (link.Split([|"#"|],System.StringSplitOptions.None)).[0] new Uri(firstPart) //only follow pages with certain extnsion or ones with no exensions let followHref (href:System.Uri) = let valid2 = set[".py"] let valid3 = set[".php";".htm";".asp"] let valid4 = set[".php3";".php4";".php5";".html";".aspx"] let arrLength = href.Segments |> Array.length let lastExtension = (href.Segments).[arrLength-1] let lengthLastExtension = Seq.length lastExtension if (lengthLastExtension <= 3) then not( lastExtension.Contains(".") ) else //test for the 2 case let last4 = lastExtension.[(lengthLastExtension-1)-3..(lengthLastExtension-1)] let isValid2 = valid2|>Seq.exists(fun validEnd -> last4.EndsWith( validEnd) ) if isValid2 then true else if lengthLastExtension <= 4 then not( last4.Contains(".") ) else let last5 = lastExtension.[(lengthLastExtension-1)-4..(lengthLastExtension-1)] let isValid3 = valid3|>Seq.exists(fun validEnd -> last5.EndsWith( validEnd) ) if isValid3 then true else if lengthLastExtension <= 5 then not( last5.Contains(".") ) else let last6 = lastExtension.[(lengthLastExtension-1)-5..(lengthLastExtension-1)] let isValid4 = valid4|>Seq.exists(fun validEnd -> last6.EndsWith( validEnd) ) if isValid4 then true else not( last6.Contains(".") ) && not(lastExtension.[0..5] = "mailto") //Create the correct links / -> add the homepage , make them a comparabel Uri let hrefLinksToUri ( uri:ComparableUri ) (hrefLinks:string list) = hrefLinks |> List.map( fun link -> try if Seq.length link <4 then Some(new Uri( uri, link )) else if link.[0..3] = "http" then Some(new Uri(link)) else Some(new Uri( uri, link )) with | _ as ex -> Console.WriteLine(link); lock error (fun () ->error<-error.Add uri) None ) |> List.filter( fun link -> link.IsSome ) |> List.map( fun o -> o.Value) |> List.map( fun uri -> new ComparableUri( string uri ) ) //Treat uri , removing ajax last part , and only following links specified b Benoit let linksToFollow (hrefUris:ComparableUri list) = hrefUris |>List.map( treatAjax ) |>List.filter( fun link -> followHref link ) |>List.map( fun uri -> new ComparableUri( string uri ) ) |>Set.ofList let needToVisit uri = ( lock visited (fun () -> not( visited.Contains uri) ) ) && (lock error (fun () -> not( error.Contains uri) )) let getLinksToFollowAsyncDelay (delay:int) ( uri: ComparableUri ) = async{ let! links = getHtmlPrimitiveAsyncDelay delay uri lock visited (fun () ->visited<-visited.Add uri) let linksToFollow = getHref links |> hrefLinksToUri uri |> linksToFollow |> Set.filter( needToVisit ) |> Set.map( fun link -> if uri.Authority=link.Authority then link else link ) return linksToFollow } //Add delays if visitng same authority let getDelay(uri:ComparableUri) (authorityDelay:Dictionary<string,int>) = let uriAuthority = uri.Authority let hasAuthority,delay = authorityDelay.TryGetValue(uriAuthority) if hasAuthority then authorityDelay.[uriAuthority] <-delay+1 delay else authorityDelay.Add(uriAuthority,1) 0 let rec getLinksToFollowFromSetAsync maxIteration ( uris: seq<ComparableUri> ) = let authorityDelay = Dictionary<string,int>() if maxIteration = 100 then Console.WriteLine("Finished") else //Unite by authority add delay for those we same authority others ignore let stopwatch= System.Diagnostics.Stopwatch() stopwatch.Start() let newLinks = uris |> Seq.map( fun uri -> let delay = lock authorityDelay (fun () -> getDelay uri authorityDelay ) getLinksToFollowAsyncDelay delay uri ) |> Async.Parallel |> Async.RunSynchronously |> Seq.concat stopwatch.Stop() Console.WriteLine("\n\n\n\n\n\n\nTimeElapse : "+string stopwatch.Elapsed+"\n\n\n\n\n\n\n\n\n") getLinksToFollowFromSetAsync (maxIteration+1) newLinks getLinksToFollowFromSetAsync 0 (seq[ComparableUri( "http://twitter.com/" )]) Console.WriteLine("Finished") Some feedBack would be great ! Thank you (note this is just something I am doing for fun)

    Read the article

  • WebCrawling Dynamic Links

    - by Jojo
    Hi Everyone, Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise.

    Read the article

  • Crawling engine architecture - Java/ Perl integration

    - by Bigtwinz
    Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can imagine, not an optimal solution. There are 3 consistent themes with this system: the retrieval of data has a "conceptual structure" for lack of a better phrase i.e. the retrieval of information follows a particular path we are only looking for very specific information so we dont have to really worry about extensive crawling for awhile (think thousands-tens of thousands of pages vs millions) crawls are url-based instead of site-based. As I enhance this alpha version to a more production-level beta I am looking to add automation and management of the retrieval of data. Additionally our other systems are Java (which I'm more proficient in) and I'd like to compartmentalize the perl aspects so we dont have to lean heavily on outside help. I've evaluated the usual suspects Nutch, Droid etc but the time spent on modifying those frameworks to suit our specific information retrieval cant be justified. So I'd like your thoughts regarding the following architecture. I want to create a solution which use Java as the interface for managing and execution of the perl scripts use Java for configuration and data access stick with perl for retrieval An example use case would be a data analyst delivers us a requirement for crawling perl developer creates the required script and uses this webapp to submit the script (which gets saved to the filesystem) the script gets kicked off from the webapp with specific parameters .... Webapp should be able to create multiple threads of the perl script to initiate multiple crawlers. So questions are what do you think how solid is integration between Java and Perl specifically from calling perl from java has someone used such a system which actually is part perl repository The goal really is to not have a whole bunch of unorganized perl scripts and put some management and organization on our information retrieval. Also, I know I can use perl do do the web part of what we want - but as I mentioned before - trying to keep perl focused. But it seems assbackwards I'm not adverse to making it an all perl solution. Open to any all suggestions and opinions. Thanks

    Read the article

  • Building an automatic web crawler

    - by Sakin
    I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is a kind of "browser simulator". I find WebKit a good option to implement my crawler, since it has all the needed technology (Javascript engine, parsers, DOM manipulation, etc.) but it seems kind of an overkill being a fully featured browser. Is there any toolkit you know that can provide the above functionality?

    Read the article

  • Getting web page after calling DownloadStringAsync()?

    - by OverTheRainbow
    Hello I don't know enough about VB.Net yet to use the richer HttpWebRequest class, so I figured I'd use the simpler WebClient class to download web pages asynchronously (to avoid freezing the UI). However, how can the asynchronous event handler actually return the web page to the calling routine? Imports System.Net Public Class Form1 Private Shared Sub DownloadStringCallback2(ByVal sender As Object, ByVal e As DownloadStringCompletedEventArgs) If e.Cancelled = False AndAlso e.Error Is Nothing Then Dim textString As String = CStr(e.Result) 'HERE : How to return textString to the calling routine? End If End Sub Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim client As WebClient = New WebClient() AddHandler client.DownloadStringCompleted, AddressOf DownloadStringCallback2 Dim uri As Uri = New Uri("http://www.google.com") client.DownloadStringAsync(uri) 'HERE : how to get web page back from callback function? End Sub End Class Thank you. Edit: I added a global, shared variable and a While/DoEvents/EndWhile, but there's got to be a cleaner way to do this :-/ Public Class Form1 Shared page As String Public Shared Sub AlertStringDownloaded(ByVal sender As Object, ByVal e As DownloadStringCompletedEventArgs) ' If the string request went as planned and wasn't cancelled: If e.Cancelled = False AndAlso e.Error Is Nothing Then page = CStr(e.Result) End If End Sub Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim wc As New WebClient AddHandler wc.DownloadStringCompleted, AddressOf AlertStringDownloaded page = Nothing wc.DownloadStringAsync(New Uri("http://www.google.com")) 'Better way to wait until page has been filled? While page Is Nothing Application.DoEvents() End While RichTextBox1.Text = page End Sub End Class

    Read the article

  • What is a good Java crawler library?

    - by DrDee
    Hi, I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each. Which open source Java library would you recommend considering: speed multithreading (or even distributed) extending it with new functionality active maintained and documentation?

    Read the article

  • Mining Groups of people from Wikipedia

    - by AlgoMan
    I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section. How should i go about it ? Should i use a crawler and get the pages and search through those using BeautifulSoup ? Or is there any other alternative to get the same from Wikipedia ?

    Read the article

  • Prevent bot from crawling certain areas of site.

    - by Skoder
    Hey, I don't know much about SEO and how web spiders work, so forgive my ignorance here. I'm creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there's no real server-side output caching going on. However, since the data can contain things the user may not wish to have displayed from search engine results, I'd like to prevent any spiders from accessing the search results page. Are there any special actions I should take to ensure that the search result directory isn't crawled? Also, would a spider even crawl a page that's dynamically generated and would any actions preventing certain directories being search mess up my search engine rankings? edit: I should add, I'm reading up on robots.txt protocol, but it relies on co-operation from the web crawler. However, I'd also like to prevent any data-mining users who will ignore the robots.txt file. I appreciate any help!

    Read the article

  • Asynchronous crawling F#

    - by jlezard
    When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response. let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) = async{ let req = (WebRequest.Create(uri)) :?> HttpWebRequest req.UserAgent<-"Mozilla" try Thread.Sleep(timer) let! resp = (req.AsyncGetResponse()) Console.WriteLine(uri.AbsoluteUri+" got response") use stream = resp.GetResponseStream() use reader = new StreamReader(stream) let html = reader.ReadToEnd() return html with | _ as ex -> return "Bad Link" } Then I do something like: let uri1 = System.Uri "http://rue89.com" let timer = 1000 let jobs = [|for i in 1..10 -> getHtmlPrimitiveAsyncTimer uri1 timer|] jobs |> Array.mapi(fun i job -> Console.WriteLine("Starting job "+string i) Async.StartAsTask(job).Result) Is this alright ? I am very unsure about 2 things: -Does the Thread.Sleep thing work for delaying the request ? -Is using StartTask a problem ? I am a beginner (as you may have noticed) in F# (coding in general actually ), and everything envolving Threads scares me :) Thanks !!

    Read the article

  • Problem extracting text from RSS feeds

    - by Gautam
    Hi, I am new to the world of Ruby and Rails. I have seen rails cast 190 and I just started playing with it. I used selector gadget to find out the CSS and XPath I have the following code.. require 'rubygems' require 'nokogiri' require 'open-uri' url = "http://www.telegraph.co.uk/sport/football/rss" doc = Nokogiri::HTML(open(url)) doc.xpath('//a').each do |paragraph| puts paragraph.text end When I extracted text from a normal HTML page with css, I could get the extracted text on the console. But when I try to do the same either with CSS or XPath for the RSS Feed for the following URL mentioned in the code above, I dont get any output. How do you extract text from RSS feeds?? I also have another silly question. Is there a way to extract text from 2 different feeds and display it on the console something like url1 = "http://www.telegraph.co.uk/sport/football/rss" url2 = "http://www.telegraph.co.uk/sport/cricket/rss" Looking forward for your help and suggestions Thank You Gautam

    Read the article

  • Can EC2 instances be set up to come from different IP ranges?

    - by Joshua Frank
    I need to run a web crawler and I want to do it from EC2 because I want the HTTP requests to come from different IP ranges so I don't get blocked. So I thought distributing this on EC2 instances might help, but I can't find any information about what the outbound IP range will be. I don't want to go to the trouble of figuring out the extra complexity of EC2 and distributed data, only to find that all the instances use the same address block and I get blocked by the server anyway. NOTE: This isn't for a DoS attack or anything. I'm trying to harvest data for a legitimate business purpose, I'm respecting robots.txt, and I'm only making one request per second, but the host is still shutting me down. Edit: Commenter Paul Dixon suggests that the act of blocking even my modest crawl indicates that the host doesn't want me to crawl them and therefore that I shouldn't do it (even assuming I can work around the blocking). Do people agree with this?

    Read the article

  • Web request returns "DOS"

    - by jlezard
    I am getting a "DOS" instead of the html string .... let getHtmlBasic (uri :System.Uri ) = use client = new WebClient() client.DownloadString( uri) let uri = System.Uri( "http://www.b-a-r-f.com/" ) getHtmlBasic uri This gives a string, "DOS" Lol what the ? All other websites seems to work ...

    Read the article

  • What is a good Java web crawler library?

    - by DrDee
    Hi, I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each. Which open source Java library would you recommend considering: speed multithreading (or even distributed) extending it with new functionality active maintained and documentation?

    Read the article

  • What's a good Web Crawler tool

    - by Glenn Slaven
    I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper. What I really need is something that I can give a site url to & it will follow every link and store the content for indexing.

    Read the article

  • Does Flickr "Know" if a Hotlinked Image Does Not Link Back to Source?

    - by Michael Robinson
    From Flickr's community guidelines: "Do link back to Flickr when you post your photos elsewhere. The Flickr service makes it possible to post images hosted on Flickr to outside web sites. However, pages on other web sites that display images hosted on flickr.com must provide a link from each photo back to its photo page on Flickr." Our company currently allows image hotlinking for user-uploaded images. It turns out that this has been more popular than we had expected, and I would like to capitalize on this if possible. We will be altering the guidelines to include a clause similar to Flickr's, quoted above. As hotlinking costs us, both in terms of server load and bandwidth, we would like to get at least something out of it, other than merely a warm feeling inside. My question: Does Flickr "know" if a hotlinked image does not link back to its source? Bonus: if Flickr knows, how?

    Read the article

  • Guidelines for good webcrawler 'Etiquette'

    - by Harry
    I'm building a search engine (for fun) and it has just struck me that potentially my little project might wreak havok by clicking on ads and all sorts of problems. So what are the guidelines for good webcrawler 'Etiquette'? Things that spring to mind: Observe Robot.txt instructions Limit the number of simultaneous requests to the same domain Don't follow ad links? Stopping the crawler from clicking on ads - This one is particularly on my mind at the moment... how do i stop my bot from 'clicking' on ads? if it is going straight to the url in the ad is it counted as a click?

    Read the article

  • need help in site classification

    - by goh
    hi guys, I have to crawl the contents of several blogs. The problem is that I need to classify whether the blogs the authors are from a specific school and is talking about the school's stuff. May i know what's the best approach in doing the crawling or how should i go about the classification?

    Read the article

  • How do travel search engines & aggregators get their source data?

    - by Alex
    I'm tossing around a few ideas for travel search engines and I'm wondering how these sites get their source data. Do they scrape all the content from airline homepages? This seems like an enormous job given the number of airlines etc out there. Is there some API or web service standard that every airline conforms too? Am I going to have to negotiate access to the data APIs for every airline I wish to index?

    Read the article

  • how to store data crawled from website

    - by Richard
    I want to crawl a website and store the content on my computer for later analysis. However my OS file system has a limit on the number of sub directories, meaning storing the original folder structure is not going to work. Suggestions? Map the URL to some filename so can store flatly? Or just shove it in a database like sqlite to avoid file system limitations?

    Read the article

  • guide on crawling the entire web ?

    - by bohohasdhfasdf
    i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) . I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model. Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl.... is it possible ? I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ? for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?

    Read the article

  • Not crawling the same content twice

    - by sirrocco
    I'm building a small application that will crawl sites where the content is growing (like on stackoverflow) the difference is that the content once created is rarely modified. Now , in the first pass I crawl all the pages in the site. But next, the paged content of that site - I don't want to re-crawl all of it , just the latest additions. So if the site has 500 pages, on the second pass if the site has 501 pages then I would only crawl the first and second pages. Would this be a good way to handle the situation ? In the end, the crawled content will end up in lucene - creating a custom search engine. So, I would like to avoid crawling multiple times the same content. Any better ideas ? EDIT : Let's say the site has a page : Results that will be accessed like so : Results?page=1 , Results?page=2 ...etc I guess that keeping a track of how many pages there were at the last crawl and just crawl the difference would be enough. ( maybe using a hash of each result on the page - if I start running into the same hashes - I should stop)

    Read the article

  • Approaches for cross server content sharing?

    - by Anonymity
    I've currently been tasked with finding a best solution to serving up content on our new site from another one of our other sites. Several approaches suggested to me, that I've looked into include using SharePoint's Lists Web Service to grab the list through javascript - which results in XSS and is not an option. Another suggestion was to build a server side custom web service and use SharePoint Request Forms to get the information - this is something I've only very briefly looked at. It's been suggested that I try permitting the requesting site in the HTTP headers of the serving site since I have access to both. This ultimately resulted in a semi-working solution that had major security holes. (I had to include username/password in the request to appease AD Authentication). This was done by allowing Access-Control-Allow-Origin: * The most direct approach I could think of was to simply build in the webpart in our new environment to have the authors manually update this content the same as they would on the other site. Are any one of the suggestions here more valid than another? Which would be the best approach? Are there other suggestions I may be overlooking? I'm also not sure if WebCrawling or Content Scrapping really holds water here...

    Read the article

1