Encoding in python with lxml - complex solution
Posted
by Vojtech R.
on Stack Overflow
See other posts from Stack Overflow
or by Vojtech R.
Published on 2010-04-21T21:30:02Z
Indexed on
2010/04/21
21:33 UTC
Read the original article
Hit count: 420
Hi,
I need to download and parse webpage with lxml and build UTF-8 xml output. I thing schema in pseudocode is more illustrative:
from lxml import etree
webfile = urllib2.urlopen(url)
root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True))
txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8))
output = etree.Element("out")
output.text = txt
outputfile.write(etree.tostring(output, encoding=utf8))
So webfile can be in any encoding (lxml should handle this). Outputfile have to be in utf-8. I'm not sure where to use encoding/coding. Is this schema ok? (I cant find good tutorial about lxml and encoding, but I can find many problems with this...) I need robust approved solution so I ask you seniors.
Many thanks
© Stack Overflow or respective owner