IDN aware tools to encode/decode human readable IRI to/from valid URI

Posted by Denis Otkidach on Stack Overflow See other posts from Stack Overflow or by Denis Otkidach
Published on 2010-05-14T09:19:30Z Indexed on 2010/05/14 9:24 UTC
Read the original article Hit count: 437

Filed under:
|
|
|
|

Let's assume a user enter address of some resource and we need to translate it to:

<a href="valid URI here">human readable form</a>

HTML4 specification refers to RFC 3986 which allows only ASCII alphanumeric characters and dash in host part and all non-ASCII character in other parts should be percent-encoded. That's what I want to put in href attribute to make link working properly in all browsers. IDN should be encoded with Punycode.

HTML5 draft refers to RFC 3987 which also allows percent-encoded unicode characters in host part and a large subset of unicode in both host and other parts without encoding them. User may enter address in any of these forms. To provide human readable form of it I need to decode all printable characters. Note that some parts of address might not correspond to valid UTF-8 sequences, usually when target site uses some other character encoding.

An example of what I'd like to get:

<a href="http://xn--80aswg.xn--p1ai/%D0%BF%D1%83%D1%82%D1%8C?%D0%B7%D0%B0%D0%BF%D1%80%D0%BE%D1%81">
http://????.??/???????????</a>

Are there any tools to solve these tasks? I'm especially interested in libraries for Python and JavaScript.

© Stack Overflow or respective owner

Related posts about html

Related posts about idn