Why is python decode replacing more than the invalid bytes from an encoded string?
- by dangra
Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome.
The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX'
>>> fragment = 'PREFIX\xe3\xabSUFFIX'
>>> fragment.decode('utf-8', 'strict')
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data
What follows is the summary of replacement policies used to handle decoding errors by python, firefox and chrome. Note how the three differs, and specially how python builtin removes the valid S (plus the invalid sequence of bytes).
by Python
The builtin replace error handler replaces the invalid \xe3\xab
plus the S from SUFFIX by U+FFFD
>>> fragment.decode('utf-8', 'replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX?UFFIX
The python implementation builtin replace error handler looks like:
>>> python_replace = lambda exc: (u'\ufffd', exc.end)
As expected, trying this gives same result than builtin:
>>> codecs.register_error('python_replace', python_replace)
>>> fragment.decode('utf-8', 'python_replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX?UFFIX
by Firefox
Firefox replaces each invalid byte by U+FFFD
>>> firefox_replace = lambda exc: (u'\ufffd', exc.start+1)
>>> codecs.register_error('firefox_replace', firefox_replace)
>>> test_string.decode('utf-8', 'firefox_replace')
u'PREFIX\ufffd\ufffdSUFFIX'
>>> print _
PREFIX??SUFFIX
by Chrome
Chrome replaces each invalid sequence of bytes by U+FFFD
>>> chrome_replace = lambda exc: (u'\ufffd', exc.end-1)
>>> codecs.register_error('chrome_replace', chrome_replace)
>>> fragment.decode('utf-8', 'chrome_replace')
u'PREFIX\ufffdSUFFIX'
>>> print _
PREFIX?SUFFIX
The main question is why builtin replace error handler for str.decode is removing
the S from SUFFIX. Also, is there any unicode's official recommended way
for handling decoding replacements?