Search Results

Search found 1838 results on 74 pages for 'miss spider'.

Page 1/74 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

Creating a spider using Scrapy, Spider generation error.

- by Nacari

I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command: scrapy-ctl.py genspider myspider myspdier-domain.com but it did not work and returns the error: Error running: scrapy-ctl.py genspider, Cannot find project settings module in python path: scrapy_settings. I know I have the path set right (to python26/scripts), but I am having difficulty figuring out what the problem is. I am new to both scrapy and python so there is a good possibility that I have failled to do something important. Also, I have been using eclipse with the Pydev plugin to edit the code if that might cause some problems.

Read the article
Site crawler/spider that tosses results into mysql

- by ian.evans

It's been suggested that we use mysql for our site's search as it'd be running on the same server that hosts our web server (nginx) and our db (mysql). Since not all of our pages are created from the database, it's been suggested that we have a crawler that can crawl the site, and toss the page url and data into mysql and have sphinx index on that. Does anyone know of an open source spider that has a mysql storing option out of the box. Thanks.

Read the article
Baidu spider is hammering my server and bloating my error_log file

- by Gravy

I am getting the following errors in my /etc/httpd/logs/error_log file [Sun Oct 20 00:04:15 2013] [error] [client 180.76.5.16] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:08:31 2013] [error] [client 180.76.5.113] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:12:47 2013] [error] [client 180.76.5.88] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:17:07 2013] [error] [client 180.76.5.138] File does not exist: /usr/local/apache/htdocs/homes These kinds of errors are so often, that my error log files are over 500MB! I have done an IP trace on the client address to find that it belongs to something called baidu. Beijing Baidu Netcom Science and Technology Co in China. Is there a way that I can just get apache to deny any incoming requests from some crummy spider that is repeatedly hitting my site??? Is there a better way of dealing with the problem? I am happy to completely block out China if it means that I can actually track real errors.

Read the article
Website crawler/spider to get site map

- by ack__

I need to retrieve a whole website map, in a format like : http://example.org/ http://example.org/product/ http://example.org/service/ http://example.org/about/ http://example.org/product/viewproduct/ I need it to be linked-based (no file or dir brute-force), like : parse homepage - retrieve all links - explore them - retrieve links, ... And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found : http://example.org/product/viewproduct?id=1 http://example.org/product/viewproduct?id=2 http://example.org/product/viewproduct?id=3 I need to get only once the http://example.org/product/viewproduct I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far. The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language. Thanks

Read the article
Spider a Website and Return URLs Only

- by Rob Wilkerson

I'm not quite sure how best to define/articulate this, but I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work: wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:' The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set? Thanks. UPDATE So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need: wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:' I'd still be interested in other/better means for doing this kind of thing, if any exist.

Read the article
Getting Started with Python: Attribute Error

- by Nacari

I am new to python and just downloaded it today. I am using it to work on a web spider, so to test it out and make sure everything was working, I downloaded a sample code. Unfortunately, it does not work and gives me the error: "AttributeError: 'MyShell' object has no attribute 'loaded' " I am not sure if the code its self has an error or I failed to do something correctly when installing python. Is there anything you have to do when installing python like adding environmental variables, etc.? And what does that error generally mean? Here is the sample code I used with imported spider class: import chilkat spider = chilkat.CkSpider() spider.Initialize("www.chilkatsoft.com") spider.AddUnspidered("http://www.chilkatsoft.com/") for i in range(0,10): success = spider.CrawlNext() if (success == True): print spider.lastUrl() else: if (spider.get_NumUnspidered() == 0): print "No more URLs to spider" else: print spider.lastErrorText() # Sleep 1 second before spidering the next URL. spider.SleepMs(1000)

Read the article
How does bing-bot( is that the right spider-name? ) and googlebot interpret 301 redirect?

- by jbcurtin

I've been looking for documentation on how the Microsoft and Google bots interpret 301 redirects. It seems that google-bot stores documents on a url based index system. But I haven't been able to figure out how bing works. Should I assume that they are still working towards coping everyone else and assume they use an algorithm close to google? Is it best to just forward a page to a new location via Javascript? I think this might be a blackhat trick, but how would I tell the bots that it's not? Is 301 redirect my best option and I just have to bit the bullet because said pages are no longer in existence? What other options do I have that I might not be aware of?

Read the article
How to create a web crawler/spider/robot?

- by Chris

Is there a way to make a web robot like websiteoutlook.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc. What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs from Google, or is there a better way? A simple example or a link to more information would be much appreciated.

Read the article
Scrapy Could not find spider Error

- by Nacari

I have been trying to get a simple spider to run with scrapy, but keep getting the error: Could not find spider for domain:stackexchange.com when I run the code with the expression scrapy-ctl.py crawl stackexchange.com. The spider is as follow: from scrapy.spider import BaseSpider from __future__ import absolute_import class StackExchangeSpider(BaseSpider): domain_name = "stackexchange.com" start_urls = [ "http://www.stackexchange.com/", ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body) SPIDER = StackExchangeSpider()` Another person posted almost the exact same problem months ago but did not say how they fixed it, http://stackoverflow.com/questions/1806990/scrapy-spider-is-not-working I have been following the turtorial exactly at http://doc.scrapy.org/intro/tutorial.html, and cannot figure out why it is not working.

Read the article
Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 4)

- by hinkmond

Here's the first sign of life of a Hexbug Spider Robot converted to become a Skynet Big Data model T-1. Yes, this is T-1 the precursor to the Cyberdyne Systems T-101 (and you know where that will lead to...) It is demonstrating a heartbeat using a simple Java SE Embedded program to drive it. See: Skynet Model T-1 Heartbeat It's alive!!! Well, almost alive. At least there's a pulse. We'll program more to its actions next, and then finally connect it to Skynet Big Data to do more advanced stuff, like hunt for Sara Connor. Java SE Embedded programming makes it simple to create the first model in the long line of T-XXX robots to take on the world. Raspberry Pi makes connecting it all together on one simple device, easy. Next post, I'll show how the wires are connected to drive the T-1 robot. Hinkmond

Read the article
Is there a way i can see why Squid (Proxy Server) determines why a resources should be a MISS?

- by Pure.Krome

Hi folks, I'm using Fiddler/FireBug to debug some of our live server web content. We're getting a lot of :- X-Cache: MISS from X-Cache-Lookup: MISS from :8080 Via: 1.1 :8080 (squid/2.7.STABE3) I thought i knew a lot about cache-control / expires / last-modified / etags, etc.. but maybe not. So .. is there a way I can run squid in some verbose way to see why it thinks a resource which i request, is cached/is not getting cached, etc.. which is why we're getting MISSes back? cheers :)

Read the article
Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 3)

- by hinkmond

In Part 2, I described what connections you need to make for this demo using a Hexbug Spider Robot, a Raspberry Pi, and Java SE Embedded for programming. Here are some photos of me doing the soldering. Software engineers should not be afraid of a little soldering work. It's all good. See: Skynet Big Data Demo (Part 2) One thing to watch out for when you open the remote is that there may be some glue covering the contact points. Make sure to use an Exacto knife or small screwdriver to scrape away any glue or non-conductive material covering each place where you need to solder. And after you are done with your soldering and you gave the solder enough time to cool, make sure all your connections are marked so that you know which wire goes where. Give each wire a very light tug to make sure it is soldered correctly and is making good contact. There are lots of videos on the Web to help you if this is your first time soldering. Check out Laday Ada's (from adafruit.com) links on how to solder if you need some additional help: http://www.ladyada.net/learn/soldering/thm.html If everything looks good, zip everything back up and meet back here for how to connect these wires to your Raspberry Pi. That will be it for the hardware part of this project. See, that wasn't so bad. Hinkmond

Read the article
Scrapy domain_name for spider

- by Zeynel

From the Scrapy tutorial: domain_name: identifies the Spider. It must be unique, that is, you can’t set the same domain name for different Spiders. Does this mean that domain_name must be a valid domain name, like domain_name = 'example.com' Or can I name domain_name = 'ex1' The problem is I had a spider that worked with domain name domain_name = 'whitecase.com' Now I created a new spider as an instance of CrawlSpider and named it domain_name = 'wc2' but I am getting the error "could not find spider for domain "wc2""

Read the article
Is there anyway of making json data readable by a Google spider?

- by leeand00

Is it possible to make JSON data readable by a Google spider? Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choice, just what I've been given to work with, its an old legacy CGI application and not an actual server-side scripting language.) My concern here is that, the google spiders will not be able to pickup/directly link to the item in question when a user clicks on it in google, being presented with an index page full of all the items, rather than being linked directly to the item they clicked on. Is there anyway of "informing" the google spider in the JSON that what they should feed the user a different link?

Read the article
Don't Miss the Oracle Virtual Tradeshow - Spotlight on Real-World Customer Success Feb 3rd

- by jay.richey

Hear from over 20 organizations like yours who are enjoying the benefits of the latest releases of Oracle Applications. Agility will talk about their upgrade to E-Business Suite HCM 12.1 and Ernest Health will highlight the benefits of their upgrade to PeopleSoft HCM 9.1. Plus don't miss the session with Gretchen Alarcon discussing Fusion HCM and how it will co-exist with your current E-Business Suite or PeopleSoft HCM system and strategy. If you are considering an upgrade or are in process of evaluating additional solutions, this is an event you don't want to miss.... February 3, 2011, 8:00 am to 1:00 pm PST View the agenda and register for this online event here.

Read the article
What is the best way to archive (spider) a site that is going to be removed?

- by Guy

Three different blogs that I read have recently announced that they are going to be discontinued and removed from the web. Although the archived pages will probably be in Google's cache for a few weeks after they've gone and some of the pages will be in the Way Back Machine I'd like to archive those sites to my hard disk for future reference. What is the best way to do this? Is there any software that transforms a blog (e.g. Blogspot) into a chronological PDF?

Read the article
Code Golf: Spider webs

- by LiraNuna

The challenge The shortest code by character count to output a spider web with rings equal to user's input. A spider web is started by reconstructing the center ring: \_|_/ _/ \_ \___/ / | \ Then adding rings equal to the amount entered by the user. A ring is another level of a "spider circles" made from \ / | and _, and wraps the center circle. Input is always guaranteed to be a single positive integer. Test cases Input 1 Output \__|__/ /\_|_/\ _/_/ \_\_ \ \___/ / \/_|_\/ / | \ Input 4 Output \_____|_____/ /\____|____/\ / /\___|___/\ \ / / /\__|__/\ \ \ / / / /\_|_/\ \ \ \ _/_/_/_/_/ \_\_\_\_\_ \ \ \ \ \___/ / / / / \ \ \ \/_|_\/ / / / \ \ \/__|__\/ / / \ \/___|___\/ / \/____|____\/ / | \ Input: 7 Output: \________|________/ /\_______|_______/\ / /\______|______/\ \ / / /\_____|_____/\ \ \ / / / /\____|____/\ \ \ \ / / / / /\___|___/\ \ \ \ \ / / / / / /\__|__/\ \ \ \ \ \ / / / / / / /\_|_/\ \ \ \ \ \ \ _/_/_/_/_/_/_/_/ \_\_\_\_\_\_\_\_ \ \ \ \ \ \ \ \___/ / / / / / / / \ \ \ \ \ \ \/_|_\/ / / / / / / \ \ \ \ \ \/__|__\/ / / / / / \ \ \ \ \/___|___\/ / / / / \ \ \ \/____|____\/ / / / \ \ \/_____|_____\/ / / \ \/______|______\/ / \/_______|_______\/ / | \ Code count includes input/output (i.e full program).

Read the article
how does spider in a search engine works?

- by Niraj CHoubey

How does crawler or spider in a search engine works

Read the article
DON'T MISS THE ORACLE LINUX GENERAL SESSION @ORACLE OPENWORLD

- by Zeynep Koch

We have had great sessions today at Openworld but tomorrow will be even better. The session that you should not miss is : Tuesday, Oct 2nd : General Session: Oracle Linux Strategy and Roadmap 10:15am, Moscone South #103 Wim Coekaerts, Sr.VP, Oracle Linux and Virtualization Engineering will talk about what Oracle Linux strategy and what is coming in the next 12 months. This is one session you should not miss and people are already registering. Stop by to hear Wim and ask questions about Linux development Top Technical Tips for Automatic and Secure Oracle Linux Deployments, 11:45am, Moscone South # 270 In this session, you will hear about deployment best practices and tips from Lenz Grimmer from Oracle and two Linux customers, Martin Breslin from SEI and Ed Bailey from Transunion talk about their experiences and insights Why Switch to Oracle Linux?, 3:30pm, Moscone South #270 In this session you will learn why Oracle Linux is best for your enterprise. There will be an Oracle speaker and Mike Radomski from SUNY talk about why they chose Oracle Linux. Please also visit the Oracle Linux Pavilion. If you stop by in one of our Partners booth you can be in the drawing for this beautiful, plush penguin. See you all tomorrow.

Read the article
Don't Miss Oracle UPK at the Oracle Applications Virtual Tradeshow

- by di.seghposs(at)oracle.com

Be sure to visit the Oracle Applications Virtual Tradeshow - Spotlight on Customer Success - February 3, 2011. If you are considering using Oracle UPK for a project or an upgrade, this is an event you don't want to miss. Hear how the City and County of San Francisco used Oracle UPK for their successful PeopleSoft upgrade. Get a chance to meet the experts and listen to 20+ customers share their success with Oracle Applications. Register Now!

Read the article
The Absolute Key to Google Search Engine Optimization - You Don't Want to Miss This

Getting top Google rankings is something that many webmasters want, but hardly any ever achieve...mainly because of something that they almost all miss. The good news is that if you are able to get one simple thing for your site, you can make your site rank at the top of Google.

Read the article
The Step Most People Miss in Google Search Engine Optimization

Most people who are involved in Google Search Engine Optimization do all the basics right but there is one small aspect to it that they miss. Read on to know more about it.

Read the article
Force request to miss cache but still store the response

- by Tom Marthenal

I have a slow web app that I've placed Varnish in front of. All of the pages are static (they don't vary for a different user), but they need to be updated every 5 minutes so they contain recent data. I have a simple script (wget --mirror) that crawls the entire website every 15 minutes. Each crawl takes about 5 minutes. The point of the crawl is to update every page in the Varnish cache so that a user never has to wait for the page to generate (since all pages have been generated recently thanks to the spider). The timeline looks like this: 00:00:00: Cache flushed 00:00:00: Spider starts crawling to update cache with new pages 00:05:00: Spider finishes crawling, all pages are updated until 1:15 A request that comes in between 0:00:00 and 0:05:00 might hit a page that hasn't been updated yet, and will be forced to wait a few seconds for a response. This isn't acceptable. What I'd like to do is, perhaps using some VCL magic, always foward requests from the spider to the backend, but still store the response in the cache. This way, a user will never have to wait for a page to generate since there is no 5-minute window in which parts of the cache are empty (except perhaps at server startup). How can I do this?

Read the article
Don't Miss A Session -- Check the Daily Updates!

- by Oracle OpenWorld Blog Team

With thousands of sessions during conference week, sometimes times and locations change. Be sure to check session updates daily so you won't miss a thing. Session updates can be found at the following URLs: Oracle OpenWorld: http://www.oracle.com/openworld/updates/monday/index.html?origref=http://www.oracle.com/openworld/index.html JavaOne: http://www.oracle.com/javaone/updates/monday/index.html?origref=http://www.oracle.com/javaone/index.html Oracle PartnerNetwork Exchange @ OpenWorld: http://www.oracle.com/opnexchange/updates/sunday/index.html?origref=http://www.oracle.com/opnexchange/index.html Customer Experience Summit @ OpenWorld: http://www.oracle.com/events/us/en/cxsummit/updates/wednesday/index.html?origref=http://www.oracle.com/events/us/en/cxsummit/index.html Java Embedded @ JavaOne: http://www.oracle.com/javaone/embedded/updates/wednesday/index.html?origref=http://www.oracle.com/javaone/embedded/index.html

Read the article
Extra Extra Read All About You - Don't Miss the May 15th Deadline

- by Get_Specialized!

Oracle PartnerNetwork (OPN) will be launching a special issue of Profit Magazine with a focus on Specialized partners. This issue – released in August 2012 - will be a collection of the most innovative Partner success Stories from our Specialized partners around the world. If you are an Oracle Specialized Partner, you don't want to miss the opportunity to showcase your success story. The story must be completed (written and approved by the customer) before May 15, 2012 to be eligible for this issue. For more details and how to submit visit http://www.oracle.com/webapps/dialogue/ns/dlgwelcome.jsp?p_ext=Y&p_dlg_id=11542138&src=7325411&Act=47

Read the article

1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >