App-Engine Parse a UrlFetch UTF-8 encoded stream
- by Davidrd91
I am trying to parse an XML from a URL using the xml.sax parser. I know there are other libraries to use but coming from Java this is the one I am most familiar with and seems the least complicated to me.
The code I'm using to parse is as follows:
parser = xml.sax.make_parser()
handler = MangaHandler()
parser.setContentHandler(handler)
url = urlfetch.Fetch('http://www.mangapanda.com/alphabetical', allow_truncated = False, follow_redirects = False, deadline = False)
xml.sax.parseString(url.content, handler)
This returns a SaxException (invalid token) once the parser reaches the first & sign:
SAXParseException: <unknown>:582:34: not well-formed (invalid token)
Because urlfetch returns a string and not a stream I cannot use the parse() (which only works with streams) and am left to use parseString() instead. To see if parsing as a stream would fix this I tried:
parser.parse(io.StringIO(url.content).encode('utf-8'))
but this returns:
TypeError: initial_value must be unicode or None, not str
I have also tried to use the urllib2 libraries which do return a stream instead of urlfetch but the file is too large and is automatically truncated, leaving me with missing data.
Any Sort of work-around for this would be greatly appreciated as I've spent days getting around one obstacle just to be stopped by another.