Should I strip the XML declaration from suds output before parsing with lxml?
- by mikl
I’m trying to implement a SOAP webservice in Python 2.6 using the suds library. That is working well, but I’ve run into a problem when trying to parse the output with lxml.
Suds returns a suds.sax.text.Text object with the reply from the SOAP service. The suds.sax.text.Text class is a subclass of the Python built-in Unicode class. In essence, it would be comparable with this Python statement:
u'<?xml version="1.0" encoding="utf-8" ?><root><lotsofelements \></root>'
Which is incongrous, since if the XML declaration is correct, the contents are UTF-8 encoded, and thus not a Python Unicode object (because those are stored in some internal encoding like UCS4).
lxml will refuse to parse this, as documented, since there is no clear answer to what encoding it should be interpreted as.
As I see it, there are two ways out of this bind:
Strip the <?xml> declaration, including the encoding.
Convert the output from Suds into a bytestring, using the specified encoding.
Currently, the data I’m receiving from the webservice is within the ASCII-range, so either way will work, but both feels very much like ugly hacks to me, and I’m not quite sure what would happen, if I start to receive data that would need a wider range of Unicode characters.
Any good ideas? I can’t imagine I’m the first one in this position…