Cleaning an XML file in Python before parsing
Posted
by Sam
on Stack Overflow
See other posts from Stack Overflow
or by Sam
Published on 2010-03-30T14:02:59Z
Indexed on
2010/03/30
14:23 UTC
Read the original article
Hit count: 367
I'm using minidom to parse an xml file and it threw an error indicating that the data is not well formed. I figured out that some of the pages have characters like ไà¸à¹€à¸Ÿà¸¥ &
, causing the parser to hiccup. Is there an easy way to clean the file before I start parsing it? Right now I'm using a regular expressing to throw away anything that isn't an alpha numeric character and the </>
characters, but it isn't quite working.
© Stack Overflow or respective owner