Scrapy issue with iTunes' AppStore

Posted by Eric on Stack Overflow See other posts from Stack Overflow or by Eric
Published on 2010-04-10T22:54:37Z Indexed on 2010/04/10 23:03 UTC
Read the original article Hit count: 699

Filed under:

appstore

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8

In the following code I have used the simplest regex which targets all apps in the US store.

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class AppStoreSpider(CrawlSpider):
    domain_name = 'itunes.apple.com'
    start_urls = ['http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8']

    rules = (
        Rule(SgmlLinkExtractor(allow='itunes\.apple\.com/us/app'),
            'parse_app', follow=True,
        ),
    )

def parse_app(self, response):
    ....

SPIDER = AppStoreSpider()

When I run it I receive the following:

 [itunes.apple.com] DEBUG: Crawled (200) <GET http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8> (referer: None)
 [itunes.apple.com] DEBUG: Filtered offsite request to 'itunes.apple.com': <GET http://itunes.apple.com/us/app/bloomberg/id281941097?mt=8>

As you can see, when it starts crawling the first page it says: "Filtered offsite request to 'itunes.apple.com'". and then the spider stops.. it also returns this message:

[ScrapyHTTPPageGetter,client] /usr/lib/python2.5/cookielib.py:1577: exceptions.UserWarning: cookielib bug!

Traceback (most recent call last): File "/usr/lib/python2.5/cookielib.py", line 1575, in make_cookies parse_ns_headers(ns_hdrs), request) File "/usr/lib/python2.5/cookielib.py", line 1532, in _cookies_from_attrs_set cookie = self._cookie_from_cookie_tuple(tup, request) File "/usr/lib/python2.5/cookielib.py", line 1451, in _cookie_from_cookie_tuple if version is not None: version = int(version) ValueError: invalid literal for int() with base 10: '"1"'

I have used the same script for other website and I didn't have this problem.

Any suggestion?

Developer IT