crawler - Page 5 - Developer IT

how to scrawl file hosting website with scrapy in python?

- by Veryel Hua

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy. I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!

Read the article

Most efficient way for testing links

- by Burnzy

I'm currently developping an app that is going through all the files on a server and checking every single hrefs to check wether they are valid or not. Using a WebClient or a HttpWebRequest/HttpWebResponse is kinda overkilling the process because it downloads the whole page each time, which is useless, I only need to check if the link do not return 404. What would be the most efficient way? Socket seems to be a good way of doing it, however I'm not quite sure how this works. Thanks for sharing your expertise!

Read the article

Robots.txt with one site but two domains

- by Dofs

I have a website which has two domains added. Both domains point to the root of the website. Is it possible to alter the robots.txt so that one of the domains doesn't get crawled, while the other still does?

Read the article

Retrieivng coordinates in this page

- by hao

Hey guys, Im trying to do some data mining and analyze data based on locations. For this site, http://www.dianping.com/shop/1898365 I am trying to figure out whats the latitude and longitude by crawling. But I cant seem to figure out where this information is stored. Can someone give me some pointers

Read the article

Use jQuery on a variable instead on the DOM ? [solved]

- by Stef

In jQuery you can do : $("a[href$='.img']").each(function(index) { alert($(this).attr('href')); } I want to write a jQuery function which crawls x-levels from a website and collects all hrefs to gif images. So when I use the the get function to retrieve another page, $.get(href, function(data) { }); I want to be able to do something like data.$("a[href$='.img']").each(function(index) { }); Is this possible ? ...UPDATE... Thanks to the answers, I was able to fix this problem. function FetchPage(href) { $.ajax({ url: href, async: false, cache: false, success: function(html){ $("#__tmp__").append("<page><name>" + href + "</name><content>" + html + "</content></page>"); } }); } See this zip file for an example how to use it.

Read the article

How should I interpret site analytics with 11 pageviews in an 3 second visit?

- by Juank

I'm using google analytics and recently i've noticed some weird trends going on. I have a lot of visits that last mere seconds but mark several page views... more than a normal human can see in that range of time. A specific case is that the only visitor from Ireland i've had until now recorded 11 pageviews in a 3 second visit. Are these crawlers? Shouldn't google analytics filter those out?

Read the article

Use jQuery on a variable instead on the DOM ?

- by Stef

In jQuery you can do : $("a[href$='.img']").each(function(index) { alert($(this).attr('href')); } I want to write a jQuery function which crawls x-levels from a website and collects all hrefs to gif images. So when I use the the get function to retrieve another page, $.get(href, function(data) { }); I want to be able to do something like data.$("a[href$='.img']").each(function(index) { }); Is this possible ?

Read the article

need to crawl images and the whole web pages

- by Kei Situ

hey, I am starting a project and wonder the relationship between the characters in images and the whole web page where the images reside. so first, i want to crawl some images and their web pages.....need to save the crawl result in local disk for further analysis. I wonder if there is any open source for this issue? thx^_^

Read the article

Solr crawler outlined packaged as a solution or product?

- by Doug

Dominique, Is the Solr crawler you outlined packaged as a solution or product? I'm looking for something similar to build a vertical search engine. http://stackoverflow.com/questions/282654/recommendations-for-a-spidering-tool-to-use-with-lucene-or-solr

Read the article

Ruby. Mongoid. Relations

- by Scepion1d

I've encountered some problems with MongoID. I have three models: require 'mongoid' class Configuration include Mongoid::Document belongs_to :user field :links, :type => Array field :root, :type => String field :objects, :type => Array field :categories, :type => Array has_many :entries end class TimeDim include Mongoid::Document field :day, :type => Integer field :month, :type => Integer field :year, :type => Integer field :day_of_week, :type => Integer field :minute, :type => Integer field :hour, :type => Integer has_many :entries end class Entry include Mongoid::Document belongs_to :configuration belongs_to :time_dim field :category, :type => String # any other dynamic fields end Creating documents for Configurations and TimeDims is successful. But when i've trying to execute following code: params = Hash.new params[:configuration] = config # an instance of Configuration from DB entry.each do |key, value| params[key.to_sym] = value # String end unless Entry.exists?(conditions: params) params[:time_dim] = self.generate_time_dim # an instance of TimeDim from DB params[:category] = self.detect_category(descr) # String Entry.new(params).save end ... i saw following output: /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/bson-1.6.1/lib/bson/bson_c.rb:24:in `serialize': Cannot serialize an object of class Configuration into BSON. (BSON::InvalidDocument) from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/bson-1.6.1/lib/bson/bson_c.rb:24:in `serialize' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongo-1.6.1/lib/mongo/cursor.rb:604:in `construct_query_message' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongo-1.6.1/lib/mongo/cursor.rb:465:in `send_initial_query' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongo-1.6.1/lib/mongo/cursor.rb:458:in `refresh' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongo-1.6.1/lib/mongo/cursor.rb:128:in `next' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongo-1.6.1/lib/mongo/db.rb:509:in `command' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongo-1.6.1/lib/mongo/cursor.rb:191:in `count' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongoid-2.4.6/lib/mongoid/cursor.rb:42:in `block in count' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongoid-2.4.6/lib/mongoid/collections/retry.rb:29:in `retry_on_connection_failure' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongoid-2.4.6/lib/mongoid/cursor.rb:41:in `count' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongoid-2.4.6/lib/mongoid/contexts/mongo.rb:93:in `count' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongoid-2.4.6/lib/mongoid/criteria.rb:45:in `count' from /home/scepion1d/Workspace/RubyMine/dana-x/.bundle/ruby/1.9.1/gems/mongoid-2.4.6/lib/mongoid/finders.rb:60:in `exists?' from /home/scepion1d/Workspace/RubyMine/dana-x/crawler/crawler.rb:110:in `block (2 levels) in push_entries_to_db' from /home/scepion1d/Workspace/RubyMine/dana-x/crawler/crawler.rb:103:in `each' from /home/scepion1d/Workspace/RubyMine/dana-x/crawler/crawler.rb:103:in `block in push_entries_to_db' from /home/scepion1d/Workspace/RubyMine/dana-x/crawler/crawler.rb:102:in `each' from /home/scepion1d/Workspace/RubyMine/dana-x/crawler/crawler.rb:102:in `push_entries_to_db' from main_starter.rb:15:in `<main>' Can anyone tell what am I doing wrong?

Read the article

How to make a jar file run on startup & and when you log out?

- by RanZilber

I have no idea where to start looking. I've been reading about daemons and didn't understand the concept. More details : I've been writing a crawler which never stops and crawlers over RSS in the internet. The crawler has been written in java - therefore its a jar right now. I'm an administrator on a machine that has Ubuntu 11.04 . There is some chances for the machine to crash , so I'd like the crawler to run every time you startup the machine. Furthermore, I'd like it to keep running even when i logged out. I'm not sure this is possible, but most of the time I'm logged out, and I still want to it crawl. Any ideas? Can someone point me in the right direction? Just looking for the simplest solution.

Read the article

how does spider in a search engine works?

- by Niraj CHoubey

How does crawler or spider in a search engine works

Read the article

DTS to AC3 conversion for LG TV using mediatomb DLNA server

- by prion crawler

I want to convert a MKV video file containing DTS audio to a stream with AC3 audio. I want to pass this resulting stream to mediatomb's transcoding feature. Mediatomb will transfer the stream via DLNA to a LG TV, which does not support DTS audio. I have tried the VLC command below but the TV does not recognize the stream, and playing the destination stream on PC does not produce sound. vlc -vvv -I dummy INPUT.file --sout \ '#transcode{acodec=ac3,ab=256k,channels=2,threads=4} \ :std{mux=ts,access=file,dst=DEST.file}' The following ffmpeg command give a stream that plays on the TV with sound, but the ffmpeg process gets killed (with signal 15) within 10-15 seconds, and then the TV restarts the playback from the beginning. This goes on in loops. ffmpeg -i INPUT.file -acodec ac3 -ab 384k -vcodec copy \ -vbsf h264_mp4toannexb -f mpegts -y DEST.file I want to have a working DLNA server which transcodes DTS to AC3, any help is appreciated.

Read the article

google search engine

- by kourosh

I am working on a google box, something like this, http://mytwentyfive.com/blog/wp-content/uploads/byme/Google%20Search%20Appliances.jpg I am pointing the crawler to a folder where there are html files. before the crawler was crawling the files and indexing them but right now it finds the pattern or the folder but not following any html files within the folder. I have tried everything I could and know but, can't think of anything else. Can someone help? thanks

Read the article

Java Design Questions - Class, Function, Access Modifiers

- by Ron

I am newbie to Java. I have some design questions. Say I have a crawler application, that does the following: 1. Crawls a url and gets its content 2. Parses the contents 3. Displays the contents How do you decide between implementing a function or a class? -- Should the parser be a function of the crawler class, or should it be a class in itself, so it can be used by other applications as well? -- If it should be a class, should it be protected or public class? How do you decide between implementing a public or protected class? -- If I had to create a class to generate stats from the parsed contents for eg, should that class be protected (so only the crawler class can access it) or should it be public? Thanks Ron

Read the article

simple scala question about httpparser

- by kula

hi all. i'm a scala newbee. i have one question. in my code ,i try to import httpparse library like this scalac -classpath /home/kula/code/201005/kookle/lib/htmlparser.jar crawler.scala and i run this code. scala main and it tell me that java.lang.NoClassDefFoundError: org/htmlparser/Parser at FetchActor$$anonfun$act$1$$anonfun$apply$1.apply(crawler.scala:21) at FetchActor$$anonfun$act$1$$anonfun$apply$1.apply(crawler.scala:13) at scala.actors.Reaction.run(Reaction.scala:78) at scala.actors.FJTask$Wrap.run(Unknown Source) at scala.actors.FJTaskRunner.scanWhileIdling(Unknown Source) at scala.actors.FJTaskRunner.run(Unknown Source) i check the file./home/kula/code/201005/kookle/lib/htmlparser.jar and it is no problem.anyone can tell me how cause this bug?

Read the article

Ruby execute code in class getting inherited to

- by AdamB

I'm trying to be able to have a global exception capture where I can add extra information when an error happens. I have two classes, "crawler" and "amazon". What I want to do is be able to call "crawl", execute a function in amazon, and use the exception handling in the crawl function. Here are the two classes I have: require 'mechanize' class Crawler Mechanize.html_parser = Nokogiri::HTML def initialize @agent = Mechanize.new end def crawl puts "crawling" begin #execute code in Amazon class here? rescue Exception => e puts "Exception: #{e.message}" puts "On url: #{@current_url}" puts e.backtrace end end def get(url) @current_url = url @agent.get(url) end end class Amazon < Crawler #some code with errors def stuff page = get("http://www.amazon.com") puts page.parser.xpath("//asldkfjasdlkj").first['href'] end end a = Amazon.new a.crawl Is there a way I can call "stuff" inside of "crawl" so I can use that exception handling over the entire stuff function? Is there a better way to accomplish this?

Read the article

VB.Net HTTPWebRequest Speed is slow comparing Python URLOpen

- by regexhacks

Hi I am coding a web-crawler which will crawl the websites and selectively parse different sections of a web site. I am a .Net developer so the choice was obvious that I did it in .Net but the speed was very slow which included downloading and parsing of HTMLPages Then I tried to just download the contents first using .Net and then same domains using python but the python was very impressive in downloading data. I have achieved downloading using python but the later part is not that easy to code in python, which obviously i don't want to do. The same batch of domain which took 100 seconds in Python was taking 20 minutes in .Net based crawler I tried http://www.eqlit.com/ to download and in took 8 seconds in Python and same was taking 100 Seconds in .Net crawler Does anyone anyone have any idea why this is slow in .Net but fast in python?

Read the article

Understanding the maximum hit-rate supported by a web-server

- by SNag

I would like to crawl a publicly available site (and one that's legal to crawl) for a personal project. From a brief trial of the crawler, I gathered that my program hits the server with a new HTTPRequest 8 times in a second. At this rate, as per my estimate, to obtain the full set of data I need about 60 full days of crawling. While the site is legal to crawl, I understand it can still be unethical to crawl at a rate that causes inconvenience to the regular traffic on the site. What I'd like to understand here is -- how high is 8 hits per second to the server I'm crawling? Could I possibly do 4 times that (by running 4 instances of my crawler in parallel) to bring the total effort down to just 15 days instead of 60? How do you find the maximum hit-rate a web-server supports? What would be the theoretical (and ethical) upper-limit for the crawl-rate so as to not adversely affect the server's routine traffic?

Read the article

Hosting a magnet link site which could possibly infringe copyrighted material?

- by Griff

I have for the last 3 months built a crawler, indexer and alot of other things for what started out to be a home project for indexing magnet links on the internet. As my project grew I have thought about releasing my collected data (which at the minute is on a public domain but with no access) to the public. Whatever the crawler sucks in goes in, and whatever the indexer decides to index gets indexed as it is a fully automated process. My question is as follows; Considering that most of the data that is collected from what I have built points to illegal copyrighted material (as most magnet links do) where would it be best to host such a site. I notice all of the already public torrent sites are hosted in India is this because there laws are less strict on copyright infringement? Have any of you hosted such a site, and if so what problems have you ran into? And as always any advice on being a webmaster for this type website?

Read the article

Problem with crawling oracle portal with SharePoint Server 2007 Search

- by John Hansen

We got "No Index Attribute" error when we try to indexing Oracle Portal from SharePoint Server 2007 Search crawler. The content source is added sucessfully. The error messages appeare in the crawler log.

Read the article

Google search box

- by user343282

I am working on a google box, something like this, http://mytwentyfive.com/blog/wp-content/uploads/byme/Google%20Search%20Appliances.jpg I am pointing the crawler to a folder where there are html files. before the crawler was crawling the files and indexing them but right now it finds the pattern or the folder but not following any html files within the folder. I have tried everything I could and know but, can't think of anything else. Can someone help? thanks

Read the article

Problem with crawling oracale portal with SharePoint Server 2007 Search

- by John Hansen

We got "No Index Attribute" error when we try to indexing Oracla Portal from SharePoint Server 2007 Search crawler. The content source is added sucessfully. The error messages appeare in the crawler log.

Read the article

Is there a way to flush html to the wire in Sinatra

- by thismatt

I have a Sinatra app with a long running process (a web scraper). I'd like the app flush the results of the crawler's progress as the crawler is running instead of at the end. I've considered forking the request and doing something fancy with ajax but this is a really basic one-pager app that really just needs to output a log to a browser as it's happening. Any suggestions?

Read the article

Erlang OTP application design

- by Toby Hede

I am struggling a little coming to grips with the OTP development model as I convert some code into an OTP app. I am essentially making a web crawler and I just don't quite know where to put the code that does the actual work. I have a supervisor which starts my worker: -behaviour(supervisor). -define(CHILD(I, Type), {I, {I, start_link, []}, permanent, 5000, Type, [I]}). init(_Args) -> Children = [ ?CHILD(crawler, worker) ], RestartStrategy = {one_for_one, 0, 1}, {ok, {RestartStrategy, Children}}. In this design, the Crawler Worker is then responsible for doing the actual work: -behaviour(gen_server). start_link() -> gen_server:start_link(?MODULE, [], []). init([]) -> inets:start(), httpc:set_options([{verbose_mode,true}]), % gen_server:cast(?MODULE, crawl), % ok = do_crawl(), {ok, #state{}}. do_crawl() -> % crawl! ok. handle_cast(crawl}, State) -> ok = do_crawl(), {noreply, State}; do_crawl spawns a fairly large number of processes and requests that handle the work of crawling via http. Question, ultimately is: where should the actual crawl happen? As can be seen above I have been experimenting with different ways of triggering the actual work, but still missing some concept essential for grokering the way things fit together. Note: some of the OTP plumbing is left out for brevity - the plumbing is all there and the system all hangs together

Search Results

Search found 261 results on 11 pages for 'crawler'.

Page 5/11 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 | Next Page >

- by Veryel Hua

- by Burnzy

- by Dofs

- by hao

- by Stef

- by Juank

- by Stef

- by Kei Situ

- by Doug

- by Scepion1d

- by RanZilber

- by Niraj CHoubey

- by prion crawler

- by kourosh

- by Ron

- by kula

- by AdamB

- by regexhacks

- by SNag

- by Griff

- by John Hansen

- by user343282

- by John Hansen

- by thismatt

- by Toby Hede

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 | Next Page >