Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback
- by zoonosis
Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to the original item. So parse is supposed to pass the semi populated item to parse_next_page which then scrapes the next bit and should return the completed item back to parse
Running the code below only returns the info collected in parse
If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5.
Im sure its something simple, I just can't see it.
class ThingSpider(BaseSpider):
name = "thing"
allowed_domains = ["somepage.com"]
start_urls = [
"http://www.somepage.com"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
print 'hello2'
request.meta['item'] = item
items.append(item)
return items
def parse_next_page(self, response):
print 'stuff'
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
return item