Why is python decode replacing more than the invalid bytes from an encoded string?
Posted
by dangra
on Stack Overflow
See other posts from Stack Overflow
or by dangra
Published on 2010-03-30T17:33:46Z
Indexed on
2010/03/30
17:53 UTC
Read the original article
Hit count: 379
Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome.
The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX'
>>> fragment = 'PREFIX\xe3\xabSUFFIX'
>>> fragment.decode('utf-8', 'strict')
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data
What follows is the summary of replacement policies used to handle decoding errors by python, firefox and chrome. Note how the three differs, and specially how python builtin removes the valid S
(plus the invalid sequence of bytes).
by Python
The builtin replace
error handler replaces the invalid \xe3\xab
plus the S
from SUFFIX
by U+FFFD
>>> fragment.decode('utf-8', 'replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX?UFFIX
The python implementation builtin replace
error handler looks like:
>>> python_replace = lambda exc: (u'\ufffd', exc.end)
As expected, trying this gives same result than builtin:
>>> codecs.register_error('python_replace', python_replace)
>>> fragment.decode('utf-8', 'python_replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX?UFFIX
by Firefox
Firefox replaces each invalid byte by U+FFFD
>>> firefox_replace = lambda exc: (u'\ufffd', exc.start+1)
>>> codecs.register_error('firefox_replace', firefox_replace)
>>> test_string.decode('utf-8', 'firefox_replace')
u'PREFIX\ufffd\ufffdSUFFIX'
>>> print _
PREFIX??SUFFIX
by Chrome
Chrome replaces each invalid sequence of bytes by U+FFFD
>>> chrome_replace = lambda exc: (u'\ufffd', exc.end-1)
>>> codecs.register_error('chrome_replace', chrome_replace)
>>> fragment.decode('utf-8', 'chrome_replace')
u'PREFIX\ufffdSUFFIX'
>>> print _
PREFIX?SUFFIX
The main question is why builtin replace
error handler for str.decode
is removing
the S
from SUFFIX
. Also, is there any unicode's official recommended way
for handling decoding replacements?
© Stack Overflow or respective owner