Search Results

Search found 325 results on 13 pages for 'tasty spider men'.

Page 1/13 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

Creating a spider using Scrapy, Spider generation error.

- by Nacari

I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command: scrapy-ctl.py genspider myspider myspdier-domain.com but it did not work and returns the error: Error running: scrapy-ctl.py genspider, Cannot find project settings module in python path: scrapy_settings. I know I have the path set right (to python26/scripts), but I am having difficulty figuring out what the problem is. I am new to both scrapy and python so there is a good possibility that I have failled to do something important. Also, I have been using eclipse with the Pydev plugin to edit the code if that might cause some problems.

Read the article
Site crawler/spider that tosses results into mysql

- by ian.evans

It's been suggested that we use mysql for our site's search as it'd be running on the same server that hosts our web server (nginx) and our db (mysql). Since not all of our pages are created from the database, it's been suggested that we have a crawler that can crawl the site, and toss the page url and data into mysql and have sphinx index on that. Does anyone know of an open source spider that has a mysql storing option out of the box. Thanks.

Read the article
Baidu spider is hammering my server and bloating my error_log file

- by Gravy

I am getting the following errors in my /etc/httpd/logs/error_log file [Sun Oct 20 00:04:15 2013] [error] [client 180.76.5.16] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:08:31 2013] [error] [client 180.76.5.113] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:12:47 2013] [error] [client 180.76.5.88] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:17:07 2013] [error] [client 180.76.5.138] File does not exist: /usr/local/apache/htdocs/homes These kinds of errors are so often, that my error log files are over 500MB! I have done an IP trace on the client address to find that it belongs to something called baidu. Beijing Baidu Netcom Science and Technology Co in China. Is there a way that I can just get apache to deny any incoming requests from some crummy spider that is repeatedly hitting my site??? Is there a better way of dealing with the problem? I am happy to completely block out China if it means that I can actually track real errors.

Read the article
Website crawler/spider to get site map

- by ack__

I need to retrieve a whole website map, in a format like : http://example.org/ http://example.org/product/ http://example.org/service/ http://example.org/about/ http://example.org/product/viewproduct/ I need it to be linked-based (no file or dir brute-force), like : parse homepage - retrieve all links - explore them - retrieve links, ... And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found : http://example.org/product/viewproduct?id=1 http://example.org/product/viewproduct?id=2 http://example.org/product/viewproduct?id=3 I need to get only once the http://example.org/product/viewproduct I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far. The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language. Thanks

Read the article
Spider a Website and Return URLs Only

- by Rob Wilkerson

I'm not quite sure how best to define/articulate this, but I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work: wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:' The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set? Thanks. UPDATE So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need: wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:' I'd still be interested in other/better means for doing this kind of thing, if any exist.

Read the article
Getting Started with Python: Attribute Error

- by Nacari

I am new to python and just downloaded it today. I am using it to work on a web spider, so to test it out and make sure everything was working, I downloaded a sample code. Unfortunately, it does not work and gives me the error: "AttributeError: 'MyShell' object has no attribute 'loaded' " I am not sure if the code its self has an error or I failed to do something correctly when installing python. Is there anything you have to do when installing python like adding environmental variables, etc.? And what does that error generally mean? Here is the sample code I used with imported spider class: import chilkat spider = chilkat.CkSpider() spider.Initialize("www.chilkatsoft.com") spider.AddUnspidered("http://www.chilkatsoft.com/") for i in range(0,10): success = spider.CrawlNext() if (success == True): print spider.lastUrl() else: if (spider.get_NumUnspidered() == 0): print "No more URLs to spider" else: print spider.lastErrorText() # Sleep 1 second before spidering the next URL. spider.SleepMs(1000)

Read the article
How does bing-bot( is that the right spider-name? ) and googlebot interpret 301 redirect?

- by jbcurtin

I've been looking for documentation on how the Microsoft and Google bots interpret 301 redirects. It seems that google-bot stores documents on a url based index system. But I haven't been able to figure out how bing works. Should I assume that they are still working towards coping everyone else and assume they use an algorithm close to google? Is it best to just forward a page to a new location via Javascript? I think this might be a blackhat trick, but how would I tell the bots that it's not? Is 301 redirect my best option and I just have to bit the bullet because said pages are no longer in existence? What other options do I have that I might not be aware of?

Read the article
How to create a web crawler/spider/robot?

- by Chris

Is there a way to make a web robot like websiteoutlook.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc. What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs from Google, or is there a better way? A simple example or a link to more information would be much appreciated.

Read the article
i demand you all men to help

- by Hello you all men

you all men i demand answer, you say not real question?? many page get big load for long time and now we are suspended and http://stackoverflow.com/questions/2890840/how-can-we-make-our-website-scalable and please you help men very sorry -bern

Read the article
Scrapy Could not find spider Error

- by Nacari

I have been trying to get a simple spider to run with scrapy, but keep getting the error: Could not find spider for domain:stackexchange.com when I run the code with the expression scrapy-ctl.py crawl stackexchange.com. The spider is as follow: from scrapy.spider import BaseSpider from __future__ import absolute_import class StackExchangeSpider(BaseSpider): domain_name = "stackexchange.com" start_urls = [ "http://www.stackexchange.com/", ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body) SPIDER = StackExchangeSpider()` Another person posted almost the exact same problem months ago but did not say how they fixed it, http://stackoverflow.com/questions/1806990/scrapy-spider-is-not-working I have been following the turtorial exactly at http://doc.scrapy.org/intro/tutorial.html, and cannot figure out why it is not working.

Read the article
Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 4)

- by hinkmond

Here's the first sign of life of a Hexbug Spider Robot converted to become a Skynet Big Data model T-1. Yes, this is T-1 the precursor to the Cyberdyne Systems T-101 (and you know where that will lead to...) It is demonstrating a heartbeat using a simple Java SE Embedded program to drive it. See: Skynet Model T-1 Heartbeat It's alive!!! Well, almost alive. At least there's a pulse. We'll program more to its actions next, and then finally connect it to Skynet Big Data to do more advanced stuff, like hunt for Sara Connor. Java SE Embedded programming makes it simple to create the first model in the long line of T-XXX robots to take on the world. Raspberry Pi makes connecting it all together on one simple device, easy. Next post, I'll show how the wires are connected to drive the T-1 robot. Hinkmond

Read the article
Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 3)

- by hinkmond

In Part 2, I described what connections you need to make for this demo using a Hexbug Spider Robot, a Raspberry Pi, and Java SE Embedded for programming. Here are some photos of me doing the soldering. Software engineers should not be afraid of a little soldering work. It's all good. See: Skynet Big Data Demo (Part 2) One thing to watch out for when you open the remote is that there may be some glue covering the contact points. Make sure to use an Exacto knife or small screwdriver to scrape away any glue or non-conductive material covering each place where you need to solder. And after you are done with your soldering and you gave the solder enough time to cool, make sure all your connections are marked so that you know which wire goes where. Give each wire a very light tug to make sure it is soldered correctly and is making good contact. There are lots of videos on the Web to help you if this is your first time soldering. Check out Laday Ada's (from adafruit.com) links on how to solder if you need some additional help: http://www.ladyada.net/learn/soldering/thm.html If everything looks good, zip everything back up and meet back here for how to connect these wires to your Raspberry Pi. That will be it for the hardware part of this project. See, that wasn't so bad. Hinkmond

Read the article
Scrapy domain_name for spider

- by Zeynel

From the Scrapy tutorial: domain_name: identifies the Spider. It must be unique, that is, you can’t set the same domain name for different Spiders. Does this mean that domain_name must be a valid domain name, like domain_name = 'example.com' Or can I name domain_name = 'ex1' The problem is I had a spider that worked with domain name domain_name = 'whitecase.com' Now I created a new spider as an instance of CrawlSpider and named it domain_name = 'wc2' but I am getting the error "could not find spider for domain "wc2""

Read the article
Week in Geek: Study finds Men more Likely to Fall for Facebook Scams

- by Asian Angel

This week we learned how to “read Blue Screen codes, clean your computer, & get started with scripting”, upgrade or install Mac OS X Lion on a Hackintosh using UniBeast, use Amazon’s barcode scanner to easily buy anything from your phone, had fun with a great set of geeky do-it-yourself projects for pets, got introduced to How-To Geek’s new Google+ account, and more. Photo by mac_filko. Use Amazon’s Barcode Scanner to Easily Buy Anything from Your Phone How To Migrate Windows 7 to a Solid State Drive Follow How-To Geek on Google+

Read the article
Windows Server 2003 network boogey men every DBA should know

- by merrillaldrich

Recently I was again visited by my old friends TCP Chimney and SynAttackProtect . (Yeah, sometimes I feel like I mostly blog about 5-year old problems, but many of us as DBA's have to work on older versions or older systems, and so repeat older problems :-). This has been written about before, but as I BinGoogled around I noticed you are more likely to find the documents if you search for the cause, and not the symptoms. Most people who face a problem, of course, know the symptoms but not the cause....(read more)

Read the article
Windows Server 2003 network boogey men every DBA should know

- by merrillaldrich

Recently I was again visited by my old friends TCP Chimney and SynAttackProtect . (Yeah, sometimes I feel like I mostly blog about 5-year old problems, but many of us as DBA's have to work on older versions or older systems, and so repeat older problems :-). This has been written about before, but as I BinGoogled around I noticed you are more likely to find the documents if you search for the cause, and not the symptoms. Most people who face a problem, of course, know the symptoms but not the cause....(read more)

Read the article
Un outil pour comparer la visibilité des publicités en ligne, découvrez les résultats de l'étude men

Miratech vous a présenté le contenu le plus efficace pour la visibilité d'une publicité dans son étude eye tracking précédente. Face aux nombreuses demandes concernant ces résultats, nous avons mis en ligne un outil pour comparer la visibilité des publicités.

Read the article
Is there anyway of making json data readable by a Google spider?

- by leeand00

Is it possible to make JSON data readable by a Google spider? Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choice, just what I've been given to work with, its an old legacy CGI application and not an actual server-side scripting language.) My concern here is that, the google spiders will not be able to pickup/directly link to the item in question when a user clicks on it in google, being presented with an index page full of all the items, rather than being linked directly to the item they clicked on. Is there anyway of "informing" the google spider in the JSON that what they should feed the user a different link?

Read the article
What is the best way to archive (spider) a site that is going to be removed?

- by Guy

Three different blogs that I read have recently announced that they are going to be discontinued and removed from the web. Although the archived pages will probably be in Google's cache for a few weeks after they've gone and some of the pages will be in the Way Back Machine I'd like to archive those sites to my hard disk for future reference. What is the best way to do this? Is there any software that transforms a blog (e.g. Blogspot) into a chronological PDF?

Read the article
Code Golf: Spider webs

- by LiraNuna

The challenge The shortest code by character count to output a spider web with rings equal to user's input. A spider web is started by reconstructing the center ring: \_|_/ _/ \_ \___/ / | \ Then adding rings equal to the amount entered by the user. A ring is another level of a "spider circles" made from \ / | and _, and wraps the center circle. Input is always guaranteed to be a single positive integer. Test cases Input 1 Output \__|__/ /\_|_/\ _/_/ \_\_ \ \___/ / \/_|_\/ / | \ Input 4 Output \_____|_____/ /\____|____/\ / /\___|___/\ \ / / /\__|__/\ \ \ / / / /\_|_/\ \ \ \ _/_/_/_/_/ \_\_\_\_\_ \ \ \ \ \___/ / / / / \ \ \ \/_|_\/ / / / \ \ \/__|__\/ / / \ \/___|___\/ / \/____|____\/ / | \ Input: 7 Output: \________|________/ /\_______|_______/\ / /\______|______/\ \ / / /\_____|_____/\ \ \ / / / /\____|____/\ \ \ \ / / / / /\___|___/\ \ \ \ \ / / / / / /\__|__/\ \ \ \ \ \ / / / / / / /\_|_/\ \ \ \ \ \ \ _/_/_/_/_/_/_/_/ \_\_\_\_\_\_\_\_ \ \ \ \ \ \ \ \___/ / / / / / / / \ \ \ \ \ \ \/_|_\/ / / / / / / \ \ \ \ \ \/__|__\/ / / / / / \ \ \ \ \/___|___\/ / / / / \ \ \ \/____|____\/ / / / \ \ \/_____|_____\/ / / \ \/______|______\/ / \/_______|_______\/ / | \ Code count includes input/output (i.e full program).

Read the article
how does spider in a search engine works?

- by Niraj CHoubey

How does crawler or spider in a search engine works

Read the article
Scrapy spider is not working

- by Zeynel

Since nothing so far is working I started a new project with python scrapy-ctl.py startproject Nu I followed the tutorial exactly, and created the folders, and a new spider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from Nu.items import NuItem from urls import u class NuSpider(CrawlSpider): domain_name = "wcase" start_urls = ['http://www.whitecase.com/aabbas/'] names = hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+') u = names.pop() rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),) def parse(self, response): self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) item = Item() item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)') return item SPIDER = NuSpider() and when I run C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase I get [Nu] ERROR: Could not find spider for domain: wcase The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong? Thanks for your help!

Read the article
Google Analytics: Spider Image

- by Fuxi

hi all, i'd like to have google analytics also spider images (eg. livecams) - i mean it should directly spider how much a certain .jpg was loaded. is this possible? thx

Read the article
Are women worse developers than men? [closed]

- by Ekaterina

Hi people, I am a software engineer and a woman. I constantly keep hearing all these jokes around me, about women in programming. They (they - stands for male colleagues) keep pointing out the differences in thinking between men and women. The truth is that when I started working as a developer, my colleagues gave a hard time only because I am a woman. They automatically assumed that I want to do only html and styling, and didn't even me giving me the chance to do something different. I am a .NET programmer and I really disliked (and still dislike) front-end developing. I do agree men and women think differently, but I don't agree that necessarily is a bad thing. Different approach of problems/goals brings more ideas and diversity. I really believe that there are good developer and bad developers despite the male/female factor. I am curious to hear overall opinion though. Would you not hire a woman developer only because is a woman? Cheers!

Read the article
How do I block a user-agent from Apache

- by rubo77

How do I realize a UA string block by regular expression in the config files of my Apache webserver? For example: if I would like to block out all bots from Apache on my debian server, that have the regular expression /\b\w+[Bb]ot\b/ or /Spider/ in their user-agent. Those bots should not be able to see any page on my server and they should not appear neither in the accesslogs nor in the errorlogs. http://global-security.blogspot.de/2009/06/how-to-block-robots-before-they-hit.html supposes to uses mod_security for that, but isn't there a simple directive for http.conf?

Read the article

1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >