Encoding in python with lxml - complex solution

Posted by Vojtech R. on Stack Overflow See other posts from Stack Overflow or by Vojtech R.
Published on 2010-04-21T21:30:02Z Indexed on 2010/04/21 21:33 UTC
Read the original article Hit count: 420

Filed under:
|

Hi,

I need to download and parse webpage with lxml and build UTF-8 xml output. I thing schema in pseudocode is more illustrative:

from lxml import etree

webfile = urllib2.urlopen(url)
root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True))

txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8))


output = etree.Element("out")
output.text = txt

outputfile.write(etree.tostring(output, encoding=utf8))

So webfile can be in any encoding (lxml should handle this). Outputfile have to be in utf-8. I'm not sure where to use encoding/coding. Is this schema ok? (I cant find good tutorial about lxml and encoding, but I can find many problems with this...) I need robust approved solution so I ask you seniors.

Many thanks

© Stack Overflow or respective owner

Related posts about lxml

Related posts about python