What is the proper way to URL encode Unicode characters?
Posted
by Josh Gibson
on Stack Overflow
See other posts from Stack Overflow
or by Josh Gibson
Published on 2009-05-26T21:18:56Z
Indexed on
2010/04/14
5:43 UTC
Read the original article
Hit count: 352
I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C.
Some interesting examples:
The heart character. If I type this into my browser:
http://www.google.com/search?q=?
Then copy and paste it, I see this URL
http://www.google.com/search?q=%E2%99%A5
which makes it seem like Firefox (or Safari) is doing this.
urllib.quote_plus(x.encode("latin-1"))
'%E2%99%A5'
which makes sense, except for things that can't be encoded in Latin-1, like the triple dot character.
…
If I type the URL
http://www.google.com/search?q=…
into my browser then copy and paste, I get
http://www.google.com/search?q=%E2%80%A6
back. Which seems to be the result of doing
urllib.quote_plus(x.encode("utf-8"))
which makes sense since … can't be encoded with Latin-1.
But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1.
Since this seems to be ambiguous:
In [67]: u"…".encode('utf-8').decode('latin-1')
Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6'
works, so I don't know how the browser figures out whether to decode that with UTF-8 or Latin-1.
What's the right thing to be doing with the special characters I need to deal with?
© Stack Overflow or respective owner