Java UTF-8 to ASCII conversion with supplements
Posted
by bozo
on Stack Overflow
See other posts from Stack Overflow
or by bozo
Published on 2010-03-30T12:43:19Z
Indexed on
2010/03/30
13:13 UTC
Read the original article
Hit count: 419
Hi,
we are accepting all sorts of national characters in UTF-8 string on the input, and we need to convert them to ASCII string on the output for some legacy use. (we don't accept Chinese and Japanese chars, only European languages)
We have a small utility to get rid of all the diacritics:
public static final String toBaseCharacters(final String sText) {
if (sText == null || sText.length() == 0)
return sText;
final char[] chars = sText.toCharArray();
final int iSize = chars.length;
final StringBuilder sb = new StringBuilder(iSize);
for (int i = 0; i < iSize; i++) {
String sLetter = new String(new char[] { chars[i] });
sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFC);
try {
byte[] bLetter = sLetter.getBytes("UTF-8");
sb.append((char) bLetter[0]);
} catch (UnsupportedEncodingException e) {
}
}
return sb.toString();
}
The question is how to replace all the german sharp s (ß, Ð, d) and other characters that get through the above normalization method, with their supplements (in case of ß, supplement would probably be "ss" and in case od Ð supplement would be either "D" or "Dj").
Is there some simple way to do it, without million of .replaceAll() calls?
So for example: Ðonardan = Djonardan, Blaß = Blass and so on.
We can replace all "problematic" chars with empty space, but would like to avoid this to make the output as similar to the input as possible.
Thank you for your answers,
Bozo
© Stack Overflow or respective owner