Encoding in python with lxml - complex solution
- by Vojtech R.
Hi,
I need to download and parse webpage with lxml and build UTF-8 xml output. I thing schema in pseudocode is more illustrative:
from lxml import etree
webfile = urllib2.urlopen(url)
root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True))
txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8))
output = etree.Element("out")
output.text = txt
outputfile.write(etree.tostring(output, encoding=utf8))
So webfile can be in any encoding (lxml should handle this). Outputfile have to be in utf-8. I'm not sure where to use encoding/coding. Is this schema ok? (I cant find good tutorial about lxml and encoding, but I can find many problems with this...) I need robust approved solution so I ask you seniors. 
Many thanks