Python.expat can't parse XML file with bad symbols. How to go around?
Posted
by culebrón
on Stack Overflow
See other posts from Stack Overflow
or by culebrón
Published on 2010-03-22T20:37:12Z
Indexed on
2010/03/23
1:21 UTC
Read the original article
Hit count: 381
I'm trying to parse an XML file with expat, and here's the line where I get bad token exception:
<tag k="name"
v="???????????????????????????????????????????????????????????????????" />
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 610127, column 37
The symbols in hex look like: \xd1?
Seems like someone wrote this string (Russian alfabet) hitting backspace a few times.
I set parser.returns_unicode = True
, but this didn't help. The 1st line is <?xml version="1.0" encoding="UTF-8"?>
. I work with a bz2 file. (bz2.BZ2File)
How can I parse the file?
© Stack Overflow or respective owner