I'm pulling some RSS feeds into a datastore in App Engine to serve up to an iPhone app. I use cron to schedule updating the RSS every x minutes. Each task only parses one RSS feed (which has 15-20 items). I frequently get warnings about high CPU usage in the App Engine dashboard, so I'm looking for ways to optimise my code.
Currently, I use minidom (since it's already there on App Engine), but I suspect it's not very efficient!
Here's the code:
dom = minidom.parseString(urlfetch.fetch(url).content)
if dom:
items = []
for node in dom.getElementsByTagName('item'):
item = RssItem(
key_name = self.getText(node.getElementsByTagName('guid')[0].childNodes),
title = self.getText(node.getElementsByTagName('title')[0].childNodes),
description = self.getText(node.getElementsByTagName('description')[0].childNodes),
modified = datetime.now(),
link = self.getText(node.getElementsByTagName('link')[0].childNodes),
categories = [self.getText(category.childNodes) for category in node.getElementsByTagName('category')]
);
items.append(item);
db.put(items);
def getText(self, nodelist):
rc = ''
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc = rc + node.data
return rc
There isn't much going on, but the scripts often take 2-6 seconds CPU time, which seems a bit excessive for looping through 20ish items and reading a few attributes.
What can I do to make this faster? Is there anything particularly bad in the above code, or should I change to another way of parsing? Are there are any libraries (that work on App Engine) that would be better, or would I be better parsing the RSS myself?