Search Results

Search found 29 results on 2 pages for 'scrapy'.

Page 1/2 | 1 2 | Next Page >

Scrapy spider is not working

- by Zeynel

Since nothing so far is working I started a new project with python scrapy-ctl.py startproject Nu I followed the tutorial exactly, and created the folders, and a new spider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from Nu.items import NuItem from urls import u class NuSpider(CrawlSpider): domain_name = "wcase" start_urls = ['http://www.whitecase.com/aabbas/'] names = hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+') u = names.pop() rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),) def parse(self, response): self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) item = Item() item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)') return item SPIDER = NuSpider() and when I run C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase I get [Nu] ERROR: Could not find spider for domain: wcase The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong? Thanks for your help!

Read the article
Creating a spider using Scrapy, Spider generation error.

- by Nacari

I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command: scrapy-ctl.py genspider myspider myspdier-domain.com but it did not work and returns the error: Error running: scrapy-ctl.py genspider, Cannot find project settings module in python path: scrapy_settings. I know I have the path set right (to python26/scripts), but I am having difficulty figuring out what the problem is. I am new to both scrapy and python so there is a good possibility that I have failled to do something important. Also, I have been using eclipse with the Pydev plugin to edit the code if that might cause some problems.

Read the article
Scrapy Could not find spider Error

- by Nacari

I have been trying to get a simple spider to run with scrapy, but keep getting the error: Could not find spider for domain:stackexchange.com when I run the code with the expression scrapy-ctl.py crawl stackexchange.com. The spider is as follow: from scrapy.spider import BaseSpider from __future__ import absolute_import class StackExchangeSpider(BaseSpider): domain_name = "stackexchange.com" start_urls = [ "http://www.stackexchange.com/", ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body) SPIDER = StackExchangeSpider()` Another person posted almost the exact same problem months ago but did not say how they fixed it, http://stackoverflow.com/questions/1806990/scrapy-spider-is-not-working I have been following the turtorial exactly at http://doc.scrapy.org/intro/tutorial.html, and cannot figure out why it is not working.

Read the article
Scrapy + Eclipse PyDev : how to setup the debugger?

- by AsTeR

I've successfully setup Eclipse with my Scrapy project. I did it by setting a new Run/Debug configuration : Whose main module links to Scrapy /usr/local/bin/scrapy for me (I've found suggestion to use cmdline.py but that failed on my computer (OSX Lion & scrapy installed through easy_install) Defining the arguments to send "crawl ny" in my case as I would if I used the Scrapy command line Setting the correct working directory (${workspace_loc:My Project/src} in my case) Eclipse can successfully launch my project, but I've no debbuger. I'm missing my breakpoints and variable inspection, does anyone know how to setup the debbugger with this environment ?

Read the article
how to get entire document in scrapy using hxs.select

- by Chris Smith

I've been at this for 12hrs and I'm hoping someone can give me a leg up. Here is my code all I want is to get the anchor and url of every link on a page as it crawls along. from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.utils.url import urljoin_rfc from scrapy.utils.response import get_base_url from urlparse import urljoin #from scrapy.item import Item from tutorial.items import DmozItem class HopitaloneSpider(CrawlSpider): name = 'dmoz' allowed_domains = ['domain.co.uk'] start_urls = [ 'http://www.domain.co.uk' ] rules = ( #Rule(SgmlLinkExtractor(allow='>example\.org', )), Rule(SgmlLinkExtractor(allow=('\w+$', )), callback='parse_item', follow=True), ) user_agent = 'Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))' def parse_item(self, response): #self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) #print response.url sites = hxs.select('//html') #item = DmozItem() items = [] for site in sites: item = DmozItem() item['title'] = site.select('a/text()').extract() item['link'] = site.select('a/@href').extract() items.append(item) return items What I'm doing wrong... my eyes hurt now.

Read the article
Using Scrapy Spider in a CGI Script or Web Framework, ie: Django

- by Israel ANY

How do I use a Python library such as Scrapy in a CGI script or even in a framework such as Django if I need to do so later? Here is the documentation I've consulted thus far, but it doesn't seem to meet the concern I have. http://doc.scrapy.org/topics/spiders.html http://doc.scrapy.org/topics/webconsole.html Critique and suggestions are welcomed!

Read the article
scrapy - python question

- by tom smith

Hi.. Maybe not the correct place to post. But, I'm going to try anyway! I've got a couple of test python parsing scripts that I created. They work enough for me to test what I'm working on. However, I recently came across the python framework, Scrapy, which is used for web scraping. My app runs in a distributed process, across a testbed of multiple servers. I'm trying to understand scrapy, to see if it provides benefits over what I'm doing. So, if possible, I'd really like to talk with a few people who are grounded in/or who use scrapy. Thanks -tom [email protected]

Read the article
How to loop over nodes with xmlfeed using scrapy python

- by Kour ipm

Hi i working on scrapy and trying xml feeds first time, below is my code class TestxmlItemSpider(XMLFeedSpider): name = "TestxmlItem" allowed_domains = {"http://www.nasinteractive.com"} start_urls = [ "http://www.nasinteractive.com/jobexport/advance/hcantexasexport.xml" ] iterator = 'iternodes' itertag = 'job' def parse_node(self, response, node): title = node.select('title/text()').extract() job_code = node.select('job-code/text()').extract() detail_url = node.select('detail-url/text()').extract() category = node.select('job-category/text()').extract() print title,";;;;;;;;;;;;;;;;;;;;;" print job_code,";;;;;;;;;;;;;;;;;;;;;" item = TestxmlItem() item['title'] = node.select('title/text()').extract() ....... return item result: File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/item.py", line 56, in __setitem__ (self.__class__.__name__, key)) exceptions.KeyError: 'TestxmlItem does not support field: title' Totally there are 200+ items so i need to loop over and assign the node text to item but here all the results are displaying at once when we print, actually how can we loop over on nodes in scraping xml files with xmlfeedspider

Read the article
Scrapy issue with iTunes' AppStore

- by Eric

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8 In the following code I have used the simplest regex which targets all apps in the US store. from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule class AppStoreSpider(CrawlSpider): domain_name = 'itunes.apple.com' start_urls = ['http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8'] rules = ( Rule(SgmlLinkExtractor(allow='itunes\.apple\.com/us/app'), 'parse_app', follow=True, ), ) def parse_app(self, response): .... SPIDER = AppStoreSpider() When I run it I receive the following: [itunes.apple.com] DEBUG: Crawled (200) <GET http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8> (referer: None) [itunes.apple.com] DEBUG: Filtered offsite request to 'itunes.apple.com': <GET http://itunes.apple.com/us/app/bloomberg/id281941097?mt=8> As you can see, when it starts crawling the first page it says: "Filtered offsite request to 'itunes.apple.com'". and then the spider stops.. it also returns this message: [ScrapyHTTPPageGetter,client] /usr/lib/python2.5/cookielib.py:1577: exceptions.UserWarning: cookielib bug! Traceback (most recent call last): File "/usr/lib/python2.5/cookielib.py", line 1575, in make_cookies parse_ns_headers(ns_hdrs), request) File "/usr/lib/python2.5/cookielib.py", line 1532, in _cookies_from_attrs_set cookie = self._cookie_from_cookie_tuple(tup, request) File "/usr/lib/python2.5/cookielib.py", line 1451, in _cookie_from_cookie_tuple if version is not None: version = int(version) ValueError: invalid literal for int() with base 10: '"1"' I have used the same script for other website and I didn't have this problem. Any suggestion?

Read the article
Scraping paginated items from a website using scrapy

- by Mridang Agarwalla

I'm using scrapy to scrape items from a site. I'm not being able to implement this scraping pattern. The site I'm trying to scrape is a forum and I scrape the site once a day. Each page has a table containing posts. New posts are added to the top of the table and as more and more posts are posted to the site, the older posts go further into the pages due to pagination. This is a very simple scenario and we will assume that the order of the posts never change. I would like to scrape this site and scrape all the "new" records until the last scraped post from yesterday is encountered. I have configured my spider to paginate endlessly and when it encounters yesterday's last scraped post, it should stop. How can implement this? (My Scrapy installation works with my Django installation using django-dynamic-scraper )

Read the article
Scrapy domain_name for spider

- by Zeynel

From the Scrapy tutorial: domain_name: identifies the Spider. It must be unique, that is, you can’t set the same domain name for different Spiders. Does this mean that domain_name must be a valid domain name, like domain_name = 'example.com' Or can I name domain_name = 'ex1' The problem is I had a spider that worked with domain name domain_name = 'whitecase.com' Now I created a new spider as an instance of CrawlSpider and named it domain_name = 'wc2' but I am getting the error "could not find spider for domain "wc2""

Read the article
Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback

- by zoonosis

Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to the original item. So parse is supposed to pass the semi populated item to parse_next_page which then scrapes the next bit and should return the completed item back to parse Running the code below only returns the info collected in parse If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5. Im sure its something simple, I just can't see it. class ThingSpider(BaseSpider): name = "thing" allowed_domains = ["somepage.com"] start_urls = [ "http://www.somepage.com" ] def parse(self, response): hxs = HtmlXPathSelector(response) items = [] for x in range (1,6): item = ScrapyItem() str_selector = '//tr[@name="row{0}"]'.format(x) item['thing1'] = hxs.select(str_selector")]/a/text()').extract() item['thing2'] = hxs.select(str_selector")]/a/@href').extract() print 'hello' request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item}) print 'hello2' request.meta['item'] = item items.append(item) return items def parse_next_page(self, response): print 'stuff' hxs = HtmlXPathSelector(response) item = response.meta['item'] item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract() return item

Read the article
Scrapy - Follow RSS links

- by Tupak Goliam

Hello, I was wondering if anyone ever tried to extract/follow RSS links using SgmlLinkExtractor/CrawlSpider. I can't get it to work... I am using the following rule: rules = ( Rule(SgmlLinkExtractor(tags=('link',), attrs=False), follow=True, callback='parse_article'), ) (having in mind that rss links are located in the link tag). I am not sure how to tell SgmlLinkExtractor to extract the text() of the link and not to search the attributes ... Any help is welcome, Thanks in advance

Read the article
scrapy cannot find div on this website [on hold]

- by Jaspal Singh Rathour

I am very new at this and have been trying to get my head around my first selector can somebody help? i am trying to extract data from page http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false all the info under div class = listing clearfix shelfListing but i cant seem to figure out how to format response.xpath(). I have managed to launch scrapy console but no matter what I type in response.xpath() i cant seem to select the right node. I know it works because when I type response.xpath('//div[@class="container"]') I get a response but don't know how to navigate to the listsing cleardix shelflisting. I am hoping that once i get this bit I can continue working my way through the spider. Thank you in advance! PS I wonder if it is not possible to scan this site - is it possible for the owners to block spiders?

Read the article
Scrapy cannot find div on this website [on hold]

- by Jaspal Singh Rathour

I am very new at this and have been trying to get my head around my first selector can somebody help? i am trying to extract data from page http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false all the info under div class = listing clearfix shelfListing but i cant seem to figure out how to format response.xpath(). I have managed to launch scrapy console but no matter what I type in response.xpath() i cant seem to select the right node. I know it works because when I type response.xpath('//div[@class="container"]') I get a response but don't know how to navigate to the listsing cleardix shelflisting. I am hoping that once i get this bit I can continue working my way through the spider. Thank you in advance! PS I wonder if it is not possible to scan this site - is it possible for the owners to block spiders?

Read the article
How to reset Scrapy parameters? (always running under same parameters)

- by Jean Ventura

I've been running my Scrapy project with a couple of accounts (the project scrapes a especific site that requieres login credentials), but no matter the parameters I set, it always runs with the same ones (same credentials). I'm running under virtualenv. Is there a variable or setting I'm missing? Edit: It seems that this problem is Twisted related. Even when I run: scrapy crawl -a user='user' -a password='pass' -o items.json -t json SpiderName I still get an error saying: ERROR: twisted.internet.error.ReactorNotRestartable And all the information I get, is the last 'succesful' run of the spider.

Read the article
how to scrawl file hosting website with scrapy in python?

- by Veryel Hua

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy. I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!

Read the article
Writing a program to scrape forums

- by seanieb

Hi, I need to write a program to scrape forums. Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy? Thanks

Read the article
xpath: string manipulation

- by Jindan Zhou

So in my scrapy project I was able to isolate some particular fields, one of the field return something like: [Rank Info] on 2013-06-27 14:26 Read 174 Times which was selected by expression: (//td[@class="show_content"]/text())[4] I usually do post-processing to extract the datetime information, i.e., 2013-06-27 14:26 Now since I've learned a little more on the xpath substring manipulation, I am wondering if it is even possible to extract that piece of information in the first place, i.e., in the xpath expression itself? Thanks,

Read the article
Extra characters Extracted with XPath and Python (html)

- by Nacari

I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?

Read the article
Scraping a page from a secure URL which is possibly using a session ID

- by VN44CA

How to scrape a page like this. https://www.procom.ca/JobList.aspx?keywords=&Cities=&reference=&JobType=0 It is secure, and requires a referrer? I can't get anything using wget or httplib2. If you go through this page, you get a list and it works on a browser but not the command line. https://www.procom.ca/jobsearch.aspx I am interested in command line fetching. thx

Read the article
What are the default/best bluetooth packages for Ubuntu 9.10?

- by Igoru

I've had some problems with Bluetooth in my Ubuntu 9.10, and ended up by uninstalling everything related. I would like to know what are the default packages for BTH in Ubuntu, and what are your recommendations (like blueman or thing like this). I have a high-end cellphone and would like to use everything that's possible, talking about bluetooth, with it (a samsung scrapy touch, or GT-B3410).

Read the article
Python version 2.6 required, which was not found in the registry

- by user74283

Can't download any python Windows modules and install. I wanted to experiment with scrapy framework and stackless but unable to install due to error "Python version 2.6 required, which was not found in the registry". Trying to install it to Windows 7, 64 bit machine

Read the article
Python veIrsion 2.6 required, which was not found in the registry

- by user74283

Cant download any pyhton windows modules and install. I wanted to experiment with scrapy framework and stackless but unable to install due to error "Python veIrsion 2.6 required, which was not found in the registry". Trying to install it to Windows 7, 64 bit machine

Read the article
Scraping html WITHOUT uniquie identifiers using python

- by Nicholas Law

I would like to design an algorithm using python that scrapes thousands of pages like this one and this one, gathers all the data and inserts it into a MySQL database. The script will be run on a weekly or bi-weekly basis to update the database of any new information added to each individual page. Ideally I would like a scraper that is easy to work with for table structured data but also data that does not have unique identifiers (ie. id and classes attributes). Which scraper add-on should I use? BeautifulSoup, Scrapy or Mechanize? Are there any particular tutorials/books I should be looking at for this desired result? In the long-run I will be implementing a mobile app that works with all this data through querying the database.

Read the article

1 2 | Next Page >