urllib2 misbehaving with dynamically loaded content

Posted by Sheena on Stack Overflow See other posts from Stack Overflow or by Sheena
Published on 2012-11-27T09:00:37Z Indexed on 2012/11/27 11:04 UTC
Read the original article Hit count: 262

Filed under:
|
|
|
|

Some Code

headers = {}
headers['user-agent'] = 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
headers['Accept-Language'] = 'en-gb,en;q=0.5'
#headers['Accept-Encoding'] = 'gzip, deflate'

request = urllib.request.Request(sURL, headers = headers)
try:
    response = urllib.request.urlopen(request)
except error.HTTPError as e:
    print('The server couldn\'t fulfill the request.')
    print('Error code: {0}'.format(e.code))
except error.URLError as e:
    print('We failed to reach a server.')
    print('Reason: {0}'.format(e.reason))
else:
    f = open('output/{0}.html'.format(sFileName),'w')
    f.write(response.read().decode('utf-8'))

A url

http://groupon.cl/descuentos/santiago-centro

The situation

Here's what I did:

  1. enable javascript in browser
  2. open url above and keep an eye on the console
  3. disable javascript
  4. repeat step 2
  5. use urllib2 to grab the webpage and save it to a file
  6. enable javascript
  7. open the file with browser and observe console
  8. repeat 7 with javascript off

results

  • In step 2 I saw that a whole lot of the page content was loaded dynamically using ajax. So the HTML that arrived was a sort of skeleton and ajax was used to fill in the gaps. This is fine and not at all surprising

  • Since the page should be seo friendly it should work fine without js. in step 4 nothing happens in the console and the skeleton page loads pre-populated rendering the ajax unnecessary. This is also completely not confusing

  • in step 7 the ajax calls are made but fail. this is also ok since the urls they are using are not local, the calls are thus broken. The page looks like the skeleton. This is also great and expected.

  • in step 8: no ajax calls are made and the skeleton is just a skeleton. I would have thought that this should behave very much like in step 4

question

What I want to do is use urllib2 to grab the html from step 4 but I cant figure out how. What am I missing and how could I pull this off?

To paraphrase

If I was writing a spider I would want to be able to grab plain ol' HTML (as in that which resulted in step 4). I dont want to execute ajax stuff or any javascript at all. I don't want to populate anything dynamically. I just want HTML.

The seo friendly site wants me to get what I want because that's what seo is all about.

How would one go about getting plain HTML content given the situation I outlined? To do it manually I would turn off js, navigate to the page and copy the html. I want to automate this.

stuff I've tried

I used wireshark to look at packet headers and the GETs sent off from my pc in steps 2 and 4 have the same headers. Reading about SEO stuff makes me think that this is pretty normal otherwise techniques such as hijax wouldn't be used.

Here are the headers my browser sends:

Host: groupon.cl
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

Here are the headers my script sends:

Accept-Encoding: identity
Host: groupon.cl
Accept-Language: en-gb,en;q=0.5
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0

The differences are:

  • my script has Connection = close instead of keep-alive. I can't see how this would cause a problem
  • my script has Accept-encoding = identity. This might be the cause of the problem. I can't really see why the host would use this field to determine the user-agent though. If I change encoding to match the browser request headers then I have trouble decoding it. I'm working on this now...

watch this space, I'll update the question as new info comes up

© Stack Overflow or respective owner

Related posts about python

Related posts about html