Why is python decode replacing more than the invalid bytes from an encoded string?

Posted by dangra on Stack Overflow See other posts from Stack Overflow or by dangra
Published on 2010-03-30T17:33:46Z Indexed on 2010/03/30 17:53 UTC
Read the original article Hit count: 457

Filed under:

scraping

Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome.

The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX'

>>> fragment = 'PREFIX\xe3\xabSUFFIX'
>>> fragment.decode('utf-8', 'strict')
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data

What follows is the summary of replacement policies used to handle decoding errors by python, firefox and chrome. Note how the three differs, and specially how python builtin removes the valid S (plus the invalid sequence of bytes).

by Python

The builtin replace error handler replaces the invalid \xe3\xab plus the S from SUFFIX by U+FFFD

>>> fragment.decode('utf-8', 'replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX?UFFIX

The python implementation builtin replace error handler looks like:

>>> python_replace = lambda exc: (u'\ufffd', exc.end)

As expected, trying this gives same result than builtin:

>>> codecs.register_error('python_replace', python_replace)
>>> fragment.decode('utf-8', 'python_replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX?UFFIX

by Firefox

Firefox replaces each invalid byte by U+FFFD

>>> firefox_replace = lambda exc: (u'\ufffd', exc.start+1)
>>> codecs.register_error('firefox_replace', firefox_replace)
>>> test_string.decode('utf-8', 'firefox_replace')
u'PREFIX\ufffd\ufffdSUFFIX'
>>> print _
PREFIX??SUFFIX

by Chrome

Chrome replaces each invalid sequence of bytes by U+FFFD

>>> chrome_replace = lambda exc: (u'\ufffd', exc.end-1)
>>> codecs.register_error('chrome_replace', chrome_replace)
>>> fragment.decode('utf-8', 'chrome_replace')
u'PREFIX\ufffdSUFFIX'
>>> print _
PREFIX?SUFFIX

The main question is why builtin replace error handler for str.decode is removing the S from SUFFIX. Also, is there any unicode's official recommended way for handling decoding replacements?

Developer IT