Allowed unicode characters in IDN host labels

Posted by Roland Franssen on Stack Overflow See other posts from Stack Overflow or by Roland Franssen
Published on 2010-05-17T19:10:18Z Indexed on 2010/05/17 19:20 UTC
Read the original article Hit count: 522

Filed under:
|
|

Hi all,

Im currently working on a "proper" URI validator and currently it all comes down to hostname validation, the rest isnt that tricky.

Im stuck at IDN hostname labels (e.g. containing unicode; possible punycode encoded strings have been decoded at this point).

My first idea was basicly a regex for TLD's not supporting IDN and one for those who do (http://www.mozilla.org/projects/security/tld-idn-policy-list.html (?)).

Respectively; ^[a-zA-Z0-9-]+$ and ^[a-zA-Z0-9-\p{L}]+$

However this is not an ideal situation, since every IDN registrar can decide which characters to allow and which not.

What im looking for is a proper, consistent, up2date data table of unicode characters allowed in various TLD's; im getting this idea i have to find all the data myself at russian and chinese registry sites (which is quite difficult).

So before spitting down the web.. i wondered is there such a list? Or are there better approaches, best/common practices etc? (I want the validation to be as strict as possible.)

Any help is welcome!

// Roland

© Stack Overflow or respective owner

Related posts about idn

Related posts about unicode