Python.expat can't parse XML file with bad symbols. How to go around?

Posted by culebrón on Stack Overflow See other posts from Stack Overflow or by culebrón
Published on 2010-03-22T20:37:12Z Indexed on 2010/03/23 1:21 UTC
Read the original article Hit count: 375

Filed under:
|
|
|

I'm trying to parse an XML file with expat, and here's the line where I get bad token exception:

<tag k="name"
v="???????????????????????????????????????????????????????????????????" />

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 610127, column 37

The symbols in hex look like: \xd1? Seems like someone wrote this string (Russian alfabet) hitting backspace a few times.

I set parser.returns_unicode = True, but this didn't help. The 1st line is <?xml version="1.0" encoding="UTF-8"?>. I work with a bz2 file. (bz2.BZ2File)

How can I parse the file?

© Stack Overflow or respective owner

Related posts about python

Related posts about expat