hobo spider - Page 2 - Developer IT

Using Scrapy Spider in a CGI Script or Web Framework, ie: Django

- by Israel ANY

How do I use a Python library such as Scrapy in a CGI script or even in a framework such as Django if I need to do so later? Here is the documentation I've consulted thus far, but it doesn't seem to meet the concern I have. http://doc.scrapy.org/topics/spiders.html http://doc.scrapy.org/topics/webconsole.html Critique and suggestions are welcomed!

Read the article

Storing URLs while Spidering

- by itemio

I created a little web spider in python which I'm using to collect URLs. I'm not interested in the content. Right now I'm keeping all the visited URLs in a set in memory, because I don't want my spider to visit URLs twice. Of course that's a very limited way of accomplishing this. So what's the best way to keep track of my visited URLs? Should I use a database? * which one? MySQL, sqlite, postgre? * how should I save the URLs? As a primary key trying to insert every URL before visiting it? Or should I write them to a file? * one file? * multiple files? how should I design the file-structure? I'm sure there are books and a lot of papers on this or similar topics. Can you give me some advice what I should read?

Read the article

wget: Turn Off Forced .html Retreival

- by Mike B

When performing a recursive download, I specify a pattern via the -R parameter for wget to reject, but if this file is a HTML file, wget downloads the file regardless of whether or not it matches the pattern. e.g. wget -r -R "*dynamicfile*" example.com still retrieves files such as example.com/dynamicfile1.html Is there a way to prevent this?

Read the article

How do I rate limit google's crawl of my class C IP block?

- by Zak

I have several sites in a class C network that all get crawled by google on a pretty regular basis. Normally this is fine. However, when google starts crawling all the sites at the same time, the small set of servers that back this IP block can take a pretty big hit on load. With google webmaster tools, you can rate limit the googlebot on a given domain, but I haven't found a way to limit the bot across an IP network yet. Anyone have experience with this? How did you fix it?

Read the article

Is it true that the Google Spider gives the most relevance of a search result to the first 68 characters of the <title>?

- by leeand00

I am reading documentation about my CMS and it states that an HTML page <title> tag is really important in SEO. It states that the Google Spider gives the most relevance to the first 68 characters of a site title. (68 characters being the number of characters that Google will display in it's search engine result pages,) Can anyone verify this is still true? I read in The Information Diet that content farms were getting too good at gaming Google's algorithm for collecting and posting SERPs and so google had to change the search algorithm.

Read the article

What does the arxiv.org anti-bot "search and destroy" actually do?

- by Brian Campbell

The lanl.arxiv.org math and scientific preprint service (formerly known as xxx.lanl.gov) has a strict policy against bots that ignore its robots.txt, Robots Beware. On that page, the have a link labelled with "Click here to initiate automated 'seek-and-destroy' against your site", which is forbidden by their robots.txt but presumably badly behaved robots will follow it, and reap the consequences. The question, what are the actual consequences? I have never had the guts to actually click on that link to see what it does. What can they be doing that is both effective and legal?

Read the article

web spidering/crawling, can i do it or just search engines?

- by bboyreason

i already had a question answered about web-scraping with wget. but as i read a little more, i realize i may be looking for a web-crawling program. particularly the part about web-crawlers being able to get specific data like links or, in my case, products. all of the products on my site have the following naming convention, website.com/uniqueAlphaNumericID.html as far as i know, no dynamic content generation is being used and only one page per one item in the above format. should i just be thinking about: wget website.com | grep *.html or should i be looking into spiders/crawlers?

Read the article

Response code for Chinese spiders? [closed]

- by pt2ph8

My server is being "attacked" by Chinese spiders that don't respect the rules in my robots.txt. They are being very aggressive and using a lot of resources, so I'm going to set up some rules in nginx to block them by user agent. Question: which response code should I return, 403, 444 (empty response in nginx) or something else? I'm wondering how the spiders will react to different status codes. What's the best practice?

Read the article

Legality, terms of service for performing a web crawl

- by Berlin Brown

I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do? Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause. Restrictions: "use any robot, spider, site search application, or other automated device, process or means to access, retrieve, scrape, or index the site" It is just research? Edit: "OK, from the standpoint of designing an efficient crawler. Should I provide some form of natural language engine to read terms of service and then abide by them."

Read the article

Worried about spiders repeatedly hitting high-demand page

- by Matt Thrower

Due to some rather bizarre architectural considerations I've had to set up something that really ought to run as a console application as a web page. It does the job of writing a large variety of text files and xml feeds from our site data for various other services to pick up so obviously it takes a little while to run and is pretty processor intensive. However, before I deploy it I'm rather worried that it might get hit repeatedly by spiders and the like. It's fine for the data to be re-written but continual hits on this page are going to trigger performance issues for obvious reasons. Is this something I ought to worry about? Or in reality is spider traffic unlikely to be intensive enough to cause problems?

Read the article

Maximum page fetch with maximum bandwith

- by Ehsan

Hi I want to create an application like a spider I've implement fetching page as the following code in multi-thread application but there is two problem 1) I want to use my maximum bandwidth to send/receive request, how should I config my request to do so (Like Download Accelerator application and the like) cause I heard the normal application will use 66% of the available bandwidth. 2) I don't know what exactly HttpWebRequest.KeepAlive do, but as its name implies I think i can create a connection to a website and without closing the connection sending another request to that web site using existing connection. does it boost performance or Im wrong ?? public PageFetchResult Fetch() { PageFetchResult fetchResult = new PageFetchResult(); try { HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(URLAddress); HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); Uri requestedURI = new Uri(URLAddress); Uri responseURI = resp.ResponseUri; string resultHTML = ""; byte[] reqHTML = ResponseAsBytes(resp); if (!string.IsNullOrEmpty(FetchingEncoding)) resultHTML = Encoding.GetEncoding(FetchingEncoding).GetString(reqHTML); else if (!string.IsNullOrEmpty(resp.CharacterSet)) resultHTML = Encoding.GetEncoding(resp.CharacterSet).GetString(reqHTML); req.Abort(); resp.Close(); fetchResult.IsOK = true; fetchResult.ResultHTML = resultHTML; } catch (Exception ex) { fetchResult.IsOK = false; fetchResult.ErrorMessage = ex.Message; } return fetchResult; }

Read the article

Mamimum page fetch with maximum bandwith

- by Ehsan

Hi I want to create an application like a spider I've implement fetching page as the following code in multi-thread application but there is two problem 1) I want to use my maximum bandwidth to send/receive request, how should I config my request to do so (Like Download Accelerator application and the like) cause I heard the normal application will use 66% of the available bandwidth. 2) I don't know what exactly HttpWebRequest.KeepAlive do, but as its name implies I think i can create a connection to a website and without closing the connection sending another request to that web site using existing connection. does it boost performance or Im wrong public PageFetchResult Fetch() { PageFetchResult fetchResult = new PageFetchResult(); try { HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(URLAddress); HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); Uri requestedURI = new Uri(URLAddress); Uri responseURI = resp.ResponseUri; if (Uri.Equals(requestedURI, responseURI)) { string resultHTML = ""; byte[] reqHTML = ResponseAsBytes(resp); if (!string.IsNullOrEmpty(FetchingEncoding)) resultHTML = Encoding.GetEncoding(FetchingEncoding).GetString(reqHTML); else if (!string.IsNullOrEmpty(resp.CharacterSet)) resultHTML = Encoding.GetEncoding(resp.CharacterSet).GetString(reqHTML); req.Abort(); resp.Close(); fetchResult.IsOK = true; fetchResult.ResultHTML = resultHTML; } else { URLAddress = responseURI.AbsoluteUri; relayPageCount++; if (relayPageCount > 5) { fetchResult.IsOK = false; fetchResult.ErrorMessage = "Maximum page redirection occured."; relayPageCount = 0; return fetchResult; } req.Abort(); resp.Close(); return Fetch(); } } catch (Exception ex) { fetchResult.IsOK = false; fetchResult.ErrorMessage = ex.Message; } return fetchResult; }

Read the article

Detecting 'stealth' web-crawlers

- by Jacco

What options are there to detect web-crawlers that do not want to be detected? (I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.) I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I consider a bot nice if it: identifies itself as a bot in the user agent string reads robots.txt (and obeys it) I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return. There are some trapdoors that can be constructed updated list (thanks Chris, gs): Adding a directory only listed (marked as disallow) in the robots.txt, Adding invisible links (possibly marked as rel="nofollow"?), style="display: none;" on link or parent container placed underneath another element with higher z-index detect who doesn't understand CaPiTaLiSaTioN, detect who tries to post replies but always fail the Captcha. detect GET requests to POST-only resources detect interval between requests detect order of pages requested detect who (consistently) requests https resources over http detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice) Some traps would be triggered by both 'good' and 'bad' bots. you could combine those with a whitelist: It trigger a trap It request robots.txt? It doest not trigger another trap because it obeyed robots.txt One other important thing here is: Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing. What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors. Update The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler. Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc... I have no illusions: I won't be able to beat them. You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them. And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.

Read the article

Turning on collision crashes game

- by MomentumGaming

I am getting a null pointer excecption to both my sprite and level. I am working on my mob class, and when I try to move him and the move function is called, the game crashes after checking collision with a null pointer excecption. Taking out the one line that actually checks if the tile located in front of it fixes the problem. Also, if i keep collision ON but don't move the position of the mob (the spider) the game works fine. I will have collision, and the spider appears on the screen, only problem is, getting it to move causes this nasty error that i just can't fix. true Exception in thread "Display" java.lang.NullPointerException at com.apcompsci.game.entity.mob.Mob.collision(Mob.java:67) at com.apcompsci.game.entity.mob.Mob.move(Mob.java:38) at com.apcompsci.game.entity.mob.spider.update(spider.java:58) at com.apcompsci.game.level.Level.update(Level.java:55) at com.apcompsci.game.Game.update(Game.java:128) at com.apcompsci.game.Game.run(Game.java:106) at java.lang.Thread.run(Unknown Source) Here is my renderMob mehtod: public void renderMob(int xp,int yp,Sprite sprite,int flip) { xp -= xOffset; yp-=yOffset; for(int y = 0; y<32; y++) { int ya = y + yp; int ys = y; if(flip == 2||flip == 3)ys = 31-y; for(int x = 0; x<32; x++) { int xa = x + xp; int xs = x; if(flip == 1||flip == 3)xs = 31-x; if(xa < -32 || xa >=width || ya<0||ya>=height) break; if(xa<0) xa =0; int col = sprite.pixels[xs+ys*32]; if(col!= 0x000000) pixels[xa+ya*width] = col; } } } My spider class which determines the sprite and where I control movement, also rendering the spider onto the screen, when I increment ya to move the sprite, I get the crash, but without ya++, it runs flawlessly with a spider sprite on screen: package com.apcompsci.game.entity.mob; import com.apcompsci.game.entity.mob.Mob.Direction; import com.apcompsci.game.graphics.Screen; import com.apcompsci.game.graphics.Sprite; import com.apcompsci.game.level.Level; public class spider extends Mob{ Direction dir; private Sprite sprite; private boolean walking; public spider(int x, int y) { this.x = x <<4; this.y = y <<4; sprite = sprite.spider_forward; } public void update() { int xa = 0, ya = 0; ya++; if(ya<0) { sprite = sprite.spider_forward; dir = Direction.UP; } if(ya>0) { sprite = sprite.spider_back; dir = Direction.DOWN; } if(xa<0) { sprite = sprite.spider_side; dir = Direction.LEFT; } if(xa>0) { sprite = sprite.spider_side; dir = Direction.LEFT; } if(xa!= 0 || ya!= 0) { System.out.println("true"); move(xa,ya); walking = true; } else{ walking = false; } } public void render(Screen screen) { screen.renderMob(x, y, sprite, 0); } } This is th mob class that contains the move() method that is called in the spider class above. This move method calls the collision method. tile and sprite comes up null in the debugger: package com.apcompsci.game.entity.mob; import java.util.ArrayList; import java.util.List; import com.apcompsci.game.entity.Entity; import com.apcompsci.game.entity.projectile.DemiGodProjectile; import com.apcompsci.game.entity.projectile.Projectile; import com.apcompsci.game.graphics.Sprite; public class Mob extends Entity{ protected Sprite sprite; protected boolean moving = false; protected enum Direction { UP,DOWN,LEFT,RIGHT } protected Direction dir; public void move(int xa,int ya) { if(xa != 0 && ya != 0) { move(xa,0); move(0,ya); return; } if(xa>0) dir = Direction.RIGHT; if(xa<0) dir = Direction.LEFT; if(ya>0)dir = Direction.DOWN; if(ya<0)dir = Direction.UP; if(!collision(xa,ya)){ x+= xa; y+=ya; } } public void update() { } public void shoot(int x, int y, double dir) { //dir = Math.toDegrees(dir); Projectile p = new DemiGodProjectile(x, y,dir); level.addProjectile(p); } public boolean collision(int xa,int ya) { boolean solid = false; for(int c = 0; c<4; c++) { int xt = ((x+xa) + c % 2 * 14 - 8 )/16; int yt = ((y+ya) + c / 2 * 12 +3 )/16; if(level.getTile(xt, yt).solid()) solid = true; } return solid; } public void render() { } } Finally, here is the method in which i call the add() method for the spider to add it to the level: protected void loadLevel(String path) { try{ BufferedImage image = ImageIO.read(SpawnLevel.class.getResource(path)); int w = width =image.getWidth(); int h = height = image.getHeight(); tiles = new int[w*h]; image.getRGB(0, 0, w,h, tiles,0, w); } catch(IOException e){ e.printStackTrace(); System.out.println("Exception! Could not load level file!"); } add(new spider(20,45)); } I don't think i need to include the level class but just in case, I have provided a gistHub link for better context. It contains all of the full classes listed above , plus my entity class and maybe another. Thanks for the help if you decide to do so, much appreciated! Also, please tell me if i'm in the wrong section of stackeoverflow, i figured that since this is the gamign section that it belonged but debugging code normally goes into the general section.

Read the article

How to track humans, but not spiders?

- by johnnietheblack

I am writing a nice little PHP/MySQL/JS ad system for my site - charging by impression. Therefore, my clients would like to be sure that they aren't paying for impressions caused by robots / spiders / etc. Is there a decently effective way to decipher between human and robot, without doing something obstructive like a captcha, or requesting "no index" or something? Basically, is there some sort of "name tag" that legit spiders have on so that my site knows their presence, therefore letting me react accordingly? And, if there isn't...how do ad companies like Google defend against this?

Read the article

JDOM 1.1: hyphen is not a valid comment character

- by Stefan Kendall

I'm using tagsoup to clean some HTML I'm scraping from the internet, and I'm getting the following error when parsing through pages with comments: The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen. I'm using JDOM 1.1, and here's the code that does the actual cleaning: SAXBuilder builder = new org.jdom.input.SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build // Don't check the doctype! At our usage rate, we'll get 503 responses // from the w3. builder.setEntityResolver(dummyEntityResolver); Reader in = new StringReader(str); org.jdom.Document doc = builder.build(in); String cleanXmlDoc = new org.jdom.output.XMLOutputter().outputString(doc); Any idea what's going wrong, or how to fix this? I need to be able to parse pages with long comment strings of

Read the article

Which web crawler to use to save news articles from a website into .txt files?

- by brokencoding

Hi, i am currently in dire need of news articles to test a LSI implementation (it's in a foreign language, so there isnt the usual packs of files ready to use). So i need a crawler that given a starting url, let's say http://news.bbc.co.uk/ follows all the contained links and saves their content into .txt files, if we could specify the format to be UTF8 i would be in heaven. I have 0 expertise in this area, so i beg you for some sugestions in which crawler to use for this task.

Read the article

SEO for Ultraseek 5.7

- by Adam N

We've got Ultraseek 5.7 indexing the content on our corporate intranet site, and we'd like to make sure our web pages are being optimized for it. Which SEO techniques are useful for Ultraseek, and where can I find documentation about these features? Features I've considered implementing: Make the title and first H1 contain the most valuable information about the page Implement a sitemap.xml file Ping the Ultraseek xpa interface when new content is added Use "SEO-Friendly" URL strings Add Meta keywords to the HTML pages.

Read the article

Indexing community images for google image search

- by Vittorio Vittori

Hi, I'm trying to understand how can I do to let my site be reachable from google image search spiders. I like how last.fm solution, and I thought to use a technique like his staff do to let google find artists images on their pages. When I'm looking for an artist and I search it on google image search, as often as not I find an image from last.fm artists page, I make an example: If I search the band Pure Reason Revolution It brings me here, the artist's image page http://www.last.fm/music/Pure+Reason+Revolution/+images/4284073 Now if I take a look to the image file, i can see it's named: http://userserve-ak.last.fm/serve/500/4284073/Pure+Reason+Revolution+4.jpg so if I try to undertand how the service works I can try to say: http://userserve-ak.last.fm/serve/ the server who serve the images 500/ the selected size for the image 4284073/ the image id for database Pure+Reason+Revolution+4.jpg the image name I thought it's difficult to think the real filename for the image is Pure+Reason+Revolution+4.jpg for image overwrite problems when an user upload it, in fact if I digit: http://userserve-ak.last.fm/serve/500/4284073.jpg I probably find the real image location and filename With this tecnique the image is highly reachable from search engines and easily archived. My question is, does exist some guide or tutorial to approach on this kind of tecniques, or something similar?

Read the article

Backlink-reporting website crawler?

- by Stewart

What tools are there out there to crawl a website and report, for each page, a list of pages within the website that link to it?

Read the article

Prepare community images for google image search indexing

- by Vittorio Vittori

Hi, I'm trying to understand how can I do to let my site be reachable from google image search spiders. I like how last.fm solution, and I thought to use a technique like his staff do to let google find artists images on their pages. When I'm looking for an artist and I search it on google image search, as often as not I find an image from last.fm artists page, I make an example: If I search the band Pure Reason Revolution It brings me here, the artist's image page http://www.last.fm/music/Pure+Reason+Revolution/+images/4284073 Now if I take a look to the image file, i can see it's named: http://userserve-ak.last.fm/serve/500/4284073/Pure+Reason+Revolution+4.jpg so if I try to undertand how the service works I can try to say: http://userserve-ak.last.fm/serve/ the server who serve the images 500/ the selected size for the image 4284073/ the image id for database Pure+Reason+Revolution+4.jpg the image name I thought it's difficult to think the real filename for the image is Pure+Reason+Revolution+4.jpg for image overwrite problems when an user upload it, in fact if I digit: http://userserve-ak.last.fm/serve/500/4284073.jpg I probably find the real image location and filename With this tecnique the image is highly reachable from search engines and easily archived. My question is, does exist some guide or tutorial to approach on this kind of tecniques, or something similar?

Read the article

Prepare your site images for google image search indexing

- by Vittorio Vittori

Hi, I'm trying to understand how can I do to let my site be reachable from google image search spiders. I like how last.fm solution, and I thought to use a technique like his staff do to let google find artists images on their pages. When I'm looking for an artist and I search it on google image search, as often as not I find an image from last.fm artists page, I make an example: If I search the band Pure Reason Revolution It brings me here, the artist's image page http://www.last.fm/music/Pure+Reason+Revolution/+images/4284073 Now if I take a look to the image file, i can see it's named: http://userserve-ak.last.fm/serve/500/4284073/Pure+Reason+Revolution+4.jpg so if I try to undertand how the service works I can try to say: http://userserve-ak.last.fm/serve/ the server who serve the images 500/ the selected size for the image 4284073/ the image id for database Pure+Reason+Revolution+4.jpg the image name I thought it's difficult to think the real filename for the image is Pure+Reason+Revolution+4.jpg for image overwrite problems when an user upload it, in fact if I digit: http://userserve-ak.last.fm/serve/500/4284073.jpg I probably find the real image location and filename With this tecnique the image is highly reachable from search engines and easily archived. My question is, does exist some guide or tutorial to approach on this kind of tecniques, or something similar?

Read the article

How to extract the headline and content from a crawled web page / article?

- by gAMBOOKa

I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler.

Read the article

Is flash's geturl(...) spiderable by google?

- by tom

If I made a homepage with an embedded .swf which had buttons that linked to other html pages on my website using the getUrl() function, would those links be spiderable by google? Or should I also put in text links outside of the .swf (which would ruin the design a bit)? I know a lot of people will argue I shouldn't have flash as the main content on the homepage (and their comments are appreciated), but bear in mind, that's not my question.

Read the article

Is there a way to catch assignments to or reads from an undefined property in the Spider-Monkey Java

- by avri

The Spider-Monkey JavaScript engine implements the noSuchMethod callback function for JavaScript Objects. This function is called whenever JavaScript tries to execute an undefined method of an Object. I would like to set a callback function to an Object that will be called whenever an undefined property in the Object is accessed or assigned to. I haven't found a noSuchProperty function implemented for JavaScript Objects and I am curios if there is any workaround that will achieve the same result.

Search Results

Search found 178 results on 8 pages for 'hobo spider'.

Page 2/8 | < Previous Page | 1 2 3 4 5 6 7 8 | Next Page >

- by Israel ANY

- by itemio

- by Mike B

- by Zak

- by leeand00

- by Brian Campbell

- by bboyreason

- by pt2ph8

- by Berlin Brown

- by Matt Thrower

- by Ehsan

- by Ehsan

- by Jacco

- by MomentumGaming

- by johnnietheblack

- by Stefan Kendall

- by brokencoding

- by Adam N

- by Vittorio Vittori

- by Stewart

- by Vittorio Vittori

- by Vittorio Vittori

- by gAMBOOKa

- by tom

- by avri

< Previous Page | 1 2 3 4 5 6 7 8 | Next Page >