Cross-platform iteration of Unicode string
Posted
by
kizzx2
on Stack Overflow
See other posts from Stack Overflow
or by kizzx2
Published on 2011-01-02T16:11:44Z
Indexed on
2011/01/02
16:54 UTC
Read the original article
Hit count: 228
I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme).
Example
The text "??????" is comprised of the code points: U+0928, U+092E, U+0938, U+094D, U+0924, U+0947
, of which, U+0938
and U+0947
are combining marks.
static void Main(string[] args)
{
const string s = "??????";
Console.WriteLine(s.Length); // Ouptuts "6"
var l = 0;
var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while(e.MoveNext()) l++;
Console.WriteLine(l); // Outputs "4"
}
So there we have it in .NET. We also have Win32's CharNextW()
#include <Windows.h>
#include <iostream>
#include <string>
int main()
{
const wchar_t * s = L"??????";
std::cout << std::wstring(s).length() << std::endl; // Gives "6"
int l = 0;
while(CharNextW(s) != s)
{
s = CharNextW(s);
++l;
}
std::cout << l << std::endl; // Gives "4"
return 0;
}
Question
Both ways I know of are specific to Microsoft. Are there portable ways to do it?
- I heard about ICU but I couldn't find something related quickly (
UnicodeString(s).length()
still gives 6). Would be an acceptable answer to point to the related function/module in ICU. - C++ doesn't have a notion of Unicode, so a lightweight cross-platform library for dealing with these issues would make an acceptable answer.
© Stack Overflow or respective owner