Allowed unicode characters in IDN host labels
- by Roland Franssen
Hi all,
Im currently working on a "proper" URI validator and currently it all comes down to hostname validation, the rest isnt that tricky.
Im stuck at IDN hostname labels (e.g. containing unicode; possible punycode encoded strings have been decoded at this point).
My first idea was basicly a regex for TLD's not supporting IDN and one for those who do (http://www.mozilla.org/projects/security/tld-idn-policy-list.html (?)).
Respectively;
^[a-zA-Z0-9-]+$ and ^[a-zA-Z0-9-\p{L}]+$
However this is not an ideal situation, since every IDN registrar can decide which characters to allow and which not.
What im looking for is a proper, consistent, up2date data table of unicode characters allowed in various TLD's; im getting this idea i have to find all the data myself at russian and chinese registry sites (which is quite difficult).
So before spitting down the web.. i wondered is there such a list? Or are there better approaches, best/common practices etc? (I want the validation to be as strict as possible.)
Any help is welcome!
// Roland