IDN aware tools to encode/decode human readable IRI to/from valid URI
- by Denis Otkidach
Let's assume a user enter address of some resource and we need to translate it to:
<a href="valid URI here">human readable form</a>
HTML4 specification refers to RFC 3986 which allows only ASCII alphanumeric characters and dash in host part and all non-ASCII character in other parts should be percent-encoded. That's what I want to put in href attribute to make link working properly in all browsers. IDN should be encoded with Punycode.
HTML5 draft refers to RFC 3987 which also allows percent-encoded unicode characters in host part and a large subset of unicode in both host and other parts without encoding them. User may enter address in any of these forms. To provide human readable form of it I need to decode all printable characters. Note that some parts of address might not correspond to valid UTF-8 sequences, usually when target site uses some other character encoding.
An example of what I'd like to get:
<a href="http://xn--80aswg.xn--p1ai/%D0%BF%D1%83%D1%82%D1%8C?%D0%B7%D0%B0%D0%BF%D1%80%D0%BE%D1%81">
http://????.??/???????????</a>
Are there any tools to solve these tasks? I'm especially interested in libraries for Python and JavaScript.