Python minidom and UTF-8 encoded XML with hash references
- by Jakob Simon-Gaarde
Hi
I am experiencing some difficulty in my home project where I need to parse a SOAP request. The SOAP is generated with gSOAP and involves string parameters with special characters like the danish letters "æøå".
gSOAP builds SOAP requests with UTF-8 encoding by default, but instead of sending the special chatacters in raw format (ie. bytes C3A6 for the special character "æ") it sends what I think is called character hash references (ie. æ).
I don't completely understand why gSOAP does it this way as I can see that it has marked the incomming payload as being UTF-8 encoded anyway (Content-Type: text/xml; charset=utf-8), but this is besides the question (I think).
Anyway I guess gSOAP probably is obeying transport rules, or what?
When I parse the request from gSOAP in python with xml.dom.minidom.parseString() I get element values as unicode objects which is fine, but the character hash references are not decoded as UTF-8 character codes. It unescapes the character hash references, but does not decode the string afterwards. In the end I have a unicode string object with UTF-8 encoding:
So if the string "æble" is contained in the XML, it comes like this in the request:
"æble"
After parsing the XML the unicode string in the DOM Text Node's data member looks like this:
u'\xc3\xa6ble'
I would expect it to look like this:
u'\xe6ble'
What am I doing wrong? Should I unescape the SOAP XML before parsing it, or is it somewhere else I should be looking for the solution, maybe gSOAP?
Thanks in advance.
Best regards Jakob Simon-Gaarde