Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback
Posted
by
zoonosis
on Stack Overflow
See other posts from Stack Overflow
or by zoonosis
Published on 2012-09-05T07:39:13Z
Indexed on
2012/09/06
15:38 UTC
Read the original article
Hit count: 296
Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to the original item. So parse
is supposed to pass the semi populated item to parse_next_page
which then scrapes the next bit and should return the completed item
back to parse
Running the code below only returns the info collected in parse
If I change the return items
to return request
I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5.
Im sure its something simple, I just can't see it.
class ThingSpider(BaseSpider):
name = "thing"
allowed_domains = ["somepage.com"]
start_urls = [
"http://www.somepage.com"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
print 'hello2'
request.meta['item'] = item
items.append(item)
return items
def parse_next_page(self, response):
print 'stuff'
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
return item
© Stack Overflow or respective owner