how to get entire document in scrapy using hxs.select

Posted by Chris Smith on Stack Overflow See other posts from Stack Overflow or by Chris Smith
Published on 2012-11-17T22:55:58Z Indexed on 2012/11/17 22:59 UTC
Read the original article Hit count: 348

Filed under:

xpath

|

scrapy

I've been at this for 12hrs and I'm hoping someone can give me a leg up.

Here is my code all I want is to get the anchor and url of every link on a page as it crawls along.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.url import urljoin_rfc
from scrapy.utils.response import get_base_url
from urlparse import urljoin

#from scrapy.item import Item
from tutorial.items import DmozItem

class HopitaloneSpider(CrawlSpider):
name = 'dmoz'
allowed_domains = ['domain.co.uk']
start_urls = [
    'http://www.domain.co.uk'
]

rules = (
    #Rule(SgmlLinkExtractor(allow='>example\.org', )),
    Rule(SgmlLinkExtractor(allow=('\w+$', )), callback='parse_item', follow=True),
)

user_agent = 'Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))'

def parse_item(self, response):
    #self.log('Hi, this is an item page! %s' % response.url)

    hxs = HtmlXPathSelector(response)
    #print response.url
    sites = hxs.select('//html')
    #item = DmozItem()
    items = []

    for site in sites: 

                   item = DmozItem()
                   item['title'] = site.select('a/text()').extract()
                   item['link'] = site.select('a/@href').extract()

                   items.append(item)

    return items

What I'm doing wrong... my eyes hurt now.

© Stack Overflow or respective owner

Related posts about xpath

xpath query in a servlet gives exception

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a Document object initialized in the init() method of the servlet and use it in the doPost() method to service the requests. selectNodeList() xpath query gives exception when the servlet services many request at same time. The Exception is shown below: Caused by: javax.xml.transform.TransformerException:… >>> More
Xpath question Xml Xpath

as seen on Stack Overflow - Search for 'Stack Overflow'
I need an xpath expression that would return the value of I need to get the value of this node. the value to extract is my xpath expression is //rates/rate[loantype='30-Year Fixed Rate'] The issue hre is that there are three value each node has a subtype element. Beside fileter for loantype… >>> More
XPath to find element based on another XPath element

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I have an Java AST and I try to find a variable inside it via XPath. Lets say the variable is called 'foobar' I could use //VariableDeclarator/VariableDeclaratorId[@Image='foobar'] but what if I dont know the text 'foobar', but want to read it from another element //VariableDeclarator/VariableDeclaratorId[@Image=//SynchronizedStatement/Expression/PrimaryExpression/PrimaryPrefix/Name] the… >>> More
php xpath query on and xpath result

as seen on Stack Overflow - Search for 'Stack Overflow'
Can I use an xpath query on a result already obtained using xpath? >>> More
how to use nokogiri methods .xpath & .at_xpath

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm learning how to use nokogiri and few questions came to me based on the code below require 'rubygems' require 'mechanize' post_agent = WWW::Mechanize.new post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708') puts "\nabsolute path with tbody gives nil" puts post_page… >>> More

Related posts about scrapy

Scrapy spider is not working

as seen on Stack Overflow - Search for 'Stack Overflow'
Since nothing so far is working I started a new project with python scrapy-ctl.py startproject Nu I followed the tutorial exactly, and created the folders, and a new spider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from… >>> More
Creating a spider using Scrapy, Spider generation error.

as seen on Stack Overflow - Search for 'Stack Overflow'
I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command: scrapy-ctl.py genspider myspider myspdier-domain.com but it did not work and… >>> More
Scrapy issue with iTunes' AppStore

as seen on Stack Overflow - Search for 'Stack Overflow'
I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8 In the following code I have used the simplest regex which targets all apps in the US store. from scrapy.contrib.linkextractors… >>> More
scrapy - python question

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi.. Maybe not the correct place to post. But, I'm going to try anyway! I've got a couple of test python parsing scripts that I created. They work enough for me to test what I'm working on. However, I recently came across the python framework, Scrapy, which is used for web scraping. My app runs… >>> More
Scrapy domain_name for spider

as seen on Stack Overflow - Search for 'Stack Overflow'
From the Scrapy tutorial: domain_name: identifies the Spider. It must be unique, that is, you can’t set the same domain name for different Spiders. Does this mean that domain_name must be a valid domain name, like domain_name = 'example.com' Or can I name domain_name = 'ex1' The problem… >>> More