App-Engine Parse a UrlFetch UTF-8 encoded stream
Posted
by
Davidrd91
on Stack Overflow
See other posts from Stack Overflow
or by Davidrd91
Published on 2012-11-25T22:53:59Z
Indexed on
2012/11/25
23:03 UTC
Read the original article
Hit count: 235
I am trying to parse an XML
from a URL
using the xml.sax
parser. I know there are other libraries to use but coming from Java
this is the one I am most familiar with and seems the least complicated to me.
The code I'm using to parse is as follows:
parser = xml.sax.make_parser()
handler = MangaHandler()
parser.setContentHandler(handler)
url = urlfetch.Fetch('http://www.mangapanda.com/alphabetical', allow_truncated = False, follow_redirects = False, deadline = False)
xml.sax.parseString(url.content, handler)
This returns a SaxException (invalid token) once the parser reaches the first &
sign:
SAXParseException: <unknown>:582:34: not well-formed (invalid token)
Because urlfetch
returns a string and not a stream I cannot use the parse()
(which only works with streams) and am left to use parseString()
instead. To see if parsing as a stream would fix this I tried:
parser.parse(io.StringIO(url.content).encode('utf-8'))
but this returns:
TypeError: initial_value must be unicode or None, not str
I have also tried to use the urllib2
libraries which do return a stream instead of urlfetch
but the file is too large and is automatically truncated, leaving me with missing data.
Any Sort of work-around for this would be greatly appreciated as I've spent days getting around one obstacle just to be stopped by another.
© Stack Overflow or respective owner