desert spider - Developer IT

Creating a spider using Scrapy, Spider generation error.

- by Nacari

I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command: scrapy-ctl.py genspider myspider myspdier-domain.com but it did not work and returns the error: Error running: scrapy-ctl.py genspider, Cannot find project settings module in python path: scrapy_settings. I know I have the path set right (to python26/scripts), but I am having difficulty figuring out what the problem is. I am new to both scrapy and python so there is a good possibility that I have failled to do something important. Also, I have been using eclipse with the Pydev plugin to edit the code if that might cause some problems.

Read the article

Site crawler/spider that tosses results into mysql

- by ian.evans

It's been suggested that we use mysql for our site's search as it'd be running on the same server that hosts our web server (nginx) and our db (mysql). Since not all of our pages are created from the database, it's been suggested that we have a crawler that can crawl the site, and toss the page url and data into mysql and have sphinx index on that. Does anyone know of an open source spider that has a mysql storing option out of the box. Thanks.

Read the article

Baidu spider is hammering my server and bloating my error_log file

- by Gravy

I am getting the following errors in my /etc/httpd/logs/error_log file [Sun Oct 20 00:04:15 2013] [error] [client 180.76.5.16] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:08:31 2013] [error] [client 180.76.5.113] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:12:47 2013] [error] [client 180.76.5.88] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:17:07 2013] [error] [client 180.76.5.138] File does not exist: /usr/local/apache/htdocs/homes These kinds of errors are so often, that my error log files are over 500MB! I have done an IP trace on the client address to find that it belongs to something called baidu. Beijing Baidu Netcom Science and Technology Co in China. Is there a way that I can just get apache to deny any incoming requests from some crummy spider that is repeatedly hitting my site??? Is there a better way of dealing with the problem? I am happy to completely block out China if it means that I can actually track real errors.

Read the article

Website crawler/spider to get site map

- by ack__

I need to retrieve a whole website map, in a format like : http://example.org/ http://example.org/product/ http://example.org/service/ http://example.org/about/ http://example.org/product/viewproduct/ I need it to be linked-based (no file or dir brute-force), like : parse homepage - retrieve all links - explore them - retrieve links, ... And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found : http://example.org/product/viewproduct?id=1 http://example.org/product/viewproduct?id=2 http://example.org/product/viewproduct?id=3 I need to get only once the http://example.org/product/viewproduct I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far. The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language. Thanks

Read the article

Spider a Website and Return URLs Only

- by Rob Wilkerson

I'm not quite sure how best to define/articulate this, but I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work: wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:' The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set? Thanks. UPDATE So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need: wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:' I'd still be interested in other/better means for doing this kind of thing, if any exist.

Read the article

Desktop Fun: Desert Area Wallpapers

- by Asian Angel

Does the sparse open look of desert areas appeal to your sense of adventure and romance? Then sit back and enjoy the scenic views with our Desert Area Wallpaper collection. Note: Click on the picture to see the full-size image—these wallpapers vary in size so you may need to crop, stretch, or place them on a colored background in order to best match them to your screen’s resolution. For more fun wallpapers be certain to visit our new Desktop Fun section. Similar Articles Productive Geek Tips Windows 7 Welcome Screen Taking Forever? Here’s the Fix (Maybe)Desktop Fun: Starship Theme WallpapersDesktop Fun: Underwater Theme WallpapersDesktop Fun: Starscape Theme WallpapersDesktop Fun: Fantasy Theme Wallpapers TouchFreeze Alternative in AutoHotkey The Icy Undertow Desktop Windows Home Server – Backup to LAN The Clear & Clean Desktop Use This Bookmarklet to Easily Get Albums Use AutoHotkey to Assign a Hotkey to a Specific Window Latest Software Reviews Tinyhacker Random Tips DVDFab 6 Revo Uninstaller Pro Registry Mechanic 9 for Windows PC Tools Internet Security Suite 2010 Get Wildlife Photography Tips at BBC’s PhotoMasterClasses Mashpedia is a Real-time Encyclopedia Playing Games In Chrome Made Easier Stop In The Name Of Love (Firefox addon) Chitika iPad Labs Gives Live iPad Sale Stats Heaven & Hell Finder Icon

Read the article

Getting Started with Python: Attribute Error

- by Nacari

I am new to python and just downloaded it today. I am using it to work on a web spider, so to test it out and make sure everything was working, I downloaded a sample code. Unfortunately, it does not work and gives me the error: "AttributeError: 'MyShell' object has no attribute 'loaded' " I am not sure if the code its self has an error or I failed to do something correctly when installing python. Is there anything you have to do when installing python like adding environmental variables, etc.? And what does that error generally mean? Here is the sample code I used with imported spider class: import chilkat spider = chilkat.CkSpider() spider.Initialize("www.chilkatsoft.com") spider.AddUnspidered("http://www.chilkatsoft.com/") for i in range(0,10): success = spider.CrawlNext() if (success == True): print spider.lastUrl() else: if (spider.get_NumUnspidered() == 0): print "No more URLs to spider" else: print spider.lastErrorText() # Sleep 1 second before spidering the next URL. spider.SleepMs(1000)

Read the article

How does bing-bot( is that the right spider-name? ) and googlebot interpret 301 redirect?

- by jbcurtin

I've been looking for documentation on how the Microsoft and Google bots interpret 301 redirects. It seems that google-bot stores documents on a url based index system. But I haven't been able to figure out how bing works. Should I assume that they are still working towards coping everyone else and assume they use an algorithm close to google? Is it best to just forward a page to a new location via Javascript? I think this might be a blackhat trick, but how would I tell the bots that it's not? Is 301 redirect my best option and I just have to bit the bullet because said pages are no longer in existence? What other options do I have that I might not be aware of?

Read the article

How to create a web crawler/spider/robot?

- by Chris

Is there a way to make a web robot like websiteoutlook.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc. What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs from Google, or is there a better way? A simple example or a link to more information would be much appreciated.

Read the article

Scrapy Could not find spider Error

- by Nacari

I have been trying to get a simple spider to run with scrapy, but keep getting the error: Could not find spider for domain:stackexchange.com when I run the code with the expression scrapy-ctl.py crawl stackexchange.com. The spider is as follow: from scrapy.spider import BaseSpider from __future__ import absolute_import class StackExchangeSpider(BaseSpider): domain_name = "stackexchange.com" start_urls = [ "http://www.stackexchange.com/", ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body) SPIDER = StackExchangeSpider()` Another person posted almost the exact same problem months ago but did not say how they fixed it, http://stackoverflow.com/questions/1806990/scrapy-spider-is-not-working I have been following the turtorial exactly at http://doc.scrapy.org/intro/tutorial.html, and cannot figure out why it is not working.

Read the article

Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 4)

- by hinkmond

Here's the first sign of life of a Hexbug Spider Robot converted to become a Skynet Big Data model T-1. Yes, this is T-1 the precursor to the Cyberdyne Systems T-101 (and you know where that will lead to...) It is demonstrating a heartbeat using a simple Java SE Embedded program to drive it. See: Skynet Model T-1 Heartbeat It's alive!!! Well, almost alive. At least there's a pulse. We'll program more to its actions next, and then finally connect it to Skynet Big Data to do more advanced stuff, like hunt for Sara Connor. Java SE Embedded programming makes it simple to create the first model in the long line of T-XXX robots to take on the world. Raspberry Pi makes connecting it all together on one simple device, easy. Next post, I'll show how the wires are connected to drive the T-1 robot. Hinkmond

Read the article

Resolving NameErrors -- Getting NameError on RAILS_END in rails_end.rb When using desert plugin and

- by dr

What's an effective approach to debug NameErrors in Rails? I'm trying to use the desert plugin (0.5.0) and the edge version of Community_Engine. I've started from scratch and gone through the installation instructions. When I attempt to start my server, I get this error: "Constant RAILS_END from rails_end.rb not found (NameError)". Problem is I cannot find rails_end.rb, nor can I find a google reference to this file or error. I've verified that the required gems are installed and current. I've dug around google and desert, but haven't found any reference to this constant. Any ideas? Thanks Here's my stack trace: => Booting Mongrel => Rails 2.3.2 application starting on http://0.0.0.0:3000 /opt/local/lib/ruby/gems/1.8/gems/desert-0.5.0/lib/desert/rails/ dependencies.rb:15:in `load_missing_constant': Constant RAILS_END from rails_end.rb not found (NameError) from /Users/dmr/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/ active_support/dependencies.rb:80:in `const_missing' from /Users/dmr/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/ active_support/dependencies.rb:92:in `const_missing' from /Users/dmr/dev/lionfold/config/environment.rb:32 from /Users/dmr/.gem/ruby/1.8/gems/rails-2.3.2/lib/ initializer.rb:111:in `run' from /Users/dmr/dev/myapp/config/environment.rb:31 from /opt/local/lib/ruby/vendor_ruby/1.8/rubygems/ custom_require.rb:31:in `gem_original_require' from /opt/local/lib/ruby/vendor_ruby/1.8/rubygems/ custom_require.rb:31:in `require' from /Users/dmr/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/ active_support/dependencies.rb:156:in `require' from /Users/dmr/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/ active_support/dependencies.rb:521:in `new_constants_in' from /Users/dmr/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/ active_support/dependencies.rb:156:in `require' from /Users/dmr/.gem/ruby/1.8/gems/rails-2.3.2/lib/commands/ server.rb:84 from /opt/local/lib/ruby/vendor_ruby/1.8/rubygems/ custom_require.rb:31:in `gem_original_require' from /opt/local/lib/ruby/vendor_ruby/1.8/rubygems/ custom_require.rb:31:in `require' from script/server:3

Read the article

Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 3)

- by hinkmond

In Part 2, I described what connections you need to make for this demo using a Hexbug Spider Robot, a Raspberry Pi, and Java SE Embedded for programming. Here are some photos of me doing the soldering. Software engineers should not be afraid of a little soldering work. It's all good. See: Skynet Big Data Demo (Part 2) One thing to watch out for when you open the remote is that there may be some glue covering the contact points. Make sure to use an Exacto knife or small screwdriver to scrape away any glue or non-conductive material covering each place where you need to solder. And after you are done with your soldering and you gave the solder enough time to cool, make sure all your connections are marked so that you know which wire goes where. Give each wire a very light tug to make sure it is soldered correctly and is making good contact. There are lots of videos on the Web to help you if this is your first time soldering. Check out Laday Ada's (from adafruit.com) links on how to solder if you need some additional help: http://www.ladyada.net/learn/soldering/thm.html If everything looks good, zip everything back up and meet back here for how to connect these wires to your Raspberry Pi. That will be it for the hardware part of this project. See, that wasn't so bad. Hinkmond

Read the article

Scrapy domain_name for spider

- by Zeynel

From the Scrapy tutorial: domain_name: identifies the Spider. It must be unique, that is, you can’t set the same domain name for different Spiders. Does this mean that domain_name must be a valid domain name, like domain_name = 'example.com' Or can I name domain_name = 'ex1' The problem is I had a spider that worked with domain name domain_name = 'whitecase.com' Now I created a new spider as an instance of CrawlSpider and named it domain_name = 'wc2' but I am getting the error "could not find spider for domain "wc2""

Read the article

Integrate test process with Rspec and Cucumber in a plugin using Desert

- by Romain Endelin

Hello, I'm developing some rails plugins with desert, and I would like to use tests with RSpec and Cucumber in the development. RSpec is integrated by default in Desert plugins, but it gives me some bugs. Finally I have created a minimal application which use my plugin, and I try to test it like a normal application without plugin. But I dislike that solution, the plugin lost its modularity, and RSpec still returns me some bugs (he can't find the models/views/controllers located in the plugin...). So am I the only one who try to test a Desert plugin (or even in a classic plugin) ? May anyone know those matters and get a solution ?

Read the article

Is there anyway of making json data readable by a Google spider?

- by leeand00

Is it possible to make JSON data readable by a Google spider? Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choice, just what I've been given to work with, its an old legacy CGI application and not an actual server-side scripting language.) My concern here is that, the google spiders will not be able to pickup/directly link to the item in question when a user clicks on it in google, being presented with an index page full of all the items, rather than being linked directly to the item they clicked on. Is there anyway of "informing" the google spider in the JSON that what they should feed the user a different link?

Read the article

What is the best way to archive (spider) a site that is going to be removed?

- by Guy

Three different blogs that I read have recently announced that they are going to be discontinued and removed from the web. Although the archived pages will probably be in Google's cache for a few weeks after they've gone and some of the pages will be in the Way Back Machine I'd like to archive those sites to my hard disk for future reference. What is the best way to do this? Is there any software that transforms a blog (e.g. Blogspot) into a chronological PDF?

Read the article

Code Golf: Spider webs

- by LiraNuna

The challenge The shortest code by character count to output a spider web with rings equal to user's input. A spider web is started by reconstructing the center ring: \_|_/ _/ \_ \___/ / | \ Then adding rings equal to the amount entered by the user. A ring is another level of a "spider circles" made from \ / | and _, and wraps the center circle. Input is always guaranteed to be a single positive integer. Test cases Input 1 Output \__|__/ /\_|_/\ _/_/ \_\_ \ \___/ / \/_|_\/ / | \ Input 4 Output \_____|_____/ /\____|____/\ / /\___|___/\ \ / / /\__|__/\ \ \ / / / /\_|_/\ \ \ \ _/_/_/_/_/ \_\_\_\_\_ \ \ \ \ \___/ / / / / \ \ \ \/_|_\/ / / / \ \ \/__|__\/ / / \ \/___|___\/ / \/____|____\/ / | \ Input: 7 Output: \________|________/ /\_______|_______/\ / /\______|______/\ \ / / /\_____|_____/\ \ \ / / / /\____|____/\ \ \ \ / / / / /\___|___/\ \ \ \ \ / / / / / /\__|__/\ \ \ \ \ \ / / / / / / /\_|_/\ \ \ \ \ \ \ _/_/_/_/_/_/_/_/ \_\_\_\_\_\_\_\_ \ \ \ \ \ \ \ \___/ / / / / / / / \ \ \ \ \ \ \/_|_\/ / / / / / / \ \ \ \ \ \/__|__\/ / / / / / \ \ \ \ \/___|___\/ / / / / \ \ \ \/____|____\/ / / / \ \ \/_____|_____\/ / / \ \/______|______\/ / \/_______|_______\/ / | \ Code count includes input/output (i.e full program).

Read the article

how does spider in a search engine works?

- by Niraj CHoubey

How does crawler or spider in a search engine works

Read the article

Scrapy spider is not working

- by Zeynel

Since nothing so far is working I started a new project with python scrapy-ctl.py startproject Nu I followed the tutorial exactly, and created the folders, and a new spider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from Nu.items import NuItem from urls import u class NuSpider(CrawlSpider): domain_name = "wcase" start_urls = ['http://www.whitecase.com/aabbas/'] names = hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+') u = names.pop() rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),) def parse(self, response): self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) item = Item() item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)') return item SPIDER = NuSpider() and when I run C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase I get [Nu] ERROR: Could not find spider for domain: wcase The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong? Thanks for your help!

Read the article

Google Analytics: Spider Image

- by Fuxi

hi all, i'd like to have google analytics also spider images (eg. livecams) - i mean it should directly spider how much a certain .jpg was loaded. is this possible? thx

Read the article

How do I block a user-agent from Apache

- by rubo77

How do I realize a UA string block by regular expression in the config files of my Apache webserver? For example: if I would like to block out all bots from Apache on my debian server, that have the regular expression /\b\w+[Bb]ot\b/ or /Spider/ in their user-agent. Those bots should not be able to see any page on my server and they should not appear neither in the accesslogs nor in the errorlogs. http://global-security.blogspot.de/2009/06/how-to-block-robots-before-they-hit.html supposes to uses mod_security for that, but isn't there a simple directive for http.conf?

Read the article

building Mozilla Spider Monkey on Ubuntu

- by Hussain

Hi. I'm trying to build spider monkey on ubuntu 10.04 (lucid). However, when I run autoconf2.13 on the js/src directory, it tells me there is no configure.in file. I can't just do the usual ./configure make sudo make install , either. What's up with it?

Read the article

Cron job on Plesk command line options (spider)

- by Paulo Bueno

Hi guys, My customer is setting up a cron task via plesk. In pycron there's an option called spider that prevents the cron to save a copy of the file on the server root. Does anyone knows wich are the command line options accepted by a cron task setted trought plesk? thx.

Read the article

SphinxSearch or a spider - which one to choose?

- by r2b2

Hello, here is my problem: We own SiteA and SiteB and they share the same server and database where we have full control. SiteC , siteD and siteE are some of the sites we own as well but reside on a different web hosts. The goal is to create a unified search functionality for all of the sites mentioned above. That is if somebody search for a term in SiteA, the search result will automatically come up with results from SiteB,SiteC,SiteD and Site E too. The search results should be shown under the website they were found in. All these websites content are stored in their own databases. If I use SphinxSearch to index the above sites,I would then require those sites that we dont have complete control with to setup a web service where i can download a database dump or csv file for indexing. Im not quite sure about how a sphider will come into play here so need your opinion. Sphinx or a spider? THanks!

Search Results

Search found 240 results on 10 pages for 'desert spider'.

Page 1/10 | 1 2 3 4 5 6 7 8 9 10 | Next Page >

- by Nacari

- by ian.evans

- by Gravy

- by ack__

- by Rob Wilkerson

- by Asian Angel

- by Nacari

- by jbcurtin

- by Chris

- by Nacari

- by hinkmond

- by dr

- by hinkmond

- by Zeynel

- by Romain Endelin

- by leeand00

- by Guy

- by LiraNuna

- by Niraj CHoubey

- by Zeynel

- by Fuxi

- by rubo77

- by Hussain

- by Paulo Bueno

- by r2b2

1 2 3 4 5 6 7 8 9 10 | Next Page >