I have written a small function, which uses ElementTree and xpath to extract the text contents of certain elements in an xml file:
#!/usr/bin/env python2.5
import doctest
from xml.etree import ElementTree
from StringIO import StringIO
def parse_xml_etree(sin, xpath):
"""
Takes as input a stream containing XML and an XPath expression.
Applies the XPath expression to the XML and returns a generator
yielding the text contents of each element returned.
>>> parse_xml_etree(
... StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
... '//elem1').next()
'one'
>>> parse_xml_etree(
... StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
... '//elem2').next()
'two'
>>> parse_xml_etree(
... StringIO('<test><null>�</null><elem3>three</elem3></test>'),
... '//elem2').next()
'three'
"""
tree = ElementTree.parse(sin)
for element in tree.findall(xpath):
yield element.text
if __name__ == '__main__':
doctest.testmod(verbose=True)
The third test fails with the following exception:
ExpatError: reference to invalid character number: line 1, column 13
Is the entity illegal XML? Regardless whether it is or not, the files I want to parse contain it, and I need some way to parse them. Any suggestions for another parser than Expat, or settings for Expat, that would allow me to do that?