Is there a list of language only character regions for UTF-8 somewhere?
Posted
by Brehtt
on Stack Overflow
See other posts from Stack Overflow
or by Brehtt
Published on 2010-05-17T03:15:36Z
Indexed on
2010/05/17
3:20 UTC
Read the original article
Hit count: 283
I'm trying to analyze some UTF-8 encoded documents in a way that recognizes different language characters. For my approach to work I need to ignore non-language characters, such as control characters, mathematical symbols etc. Just trying to dissect the basic Latin section of the UTF standard has resulted in multiple regions, with characters like the division symbol being right in the middle of a range of valid Latin characters.
Is there a list somewhere that identifies these regions? Or better yet, a Regex that defines the regions or something in C# that can identify the different characters?
© Stack Overflow or respective owner