Java parsing UTF8
Posted
by Jack
on Stack Overflow
See other posts from Stack Overflow
or by Jack
Published on 2010-04-06T16:37:18Z
Indexed on
2010/04/06
16:43 UTC
Read the original article
Hit count: 476
I have the following issue with a UTF8 files structured as following:
FIELD1§FIELD2§FIELD3§FIELD4
Looking at hexadecimal values of the file it uses A7 to codify §. So according to this codify it should be UTF8, but it's strange because A7 > 7F so 1 byte shouldn't be enough to codify §.
So I tried using directly a BufferedReader with a specified charset:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(input), utf8))
but when I try to tokenize the string with
SmartTokenizer st = new SmartTokenizer(toTokenize, "§")
(the SmartTokenizer is a modified version of the StringTokenizer that keeps empty tokens)
no splitting occurs, and if I try to print the string I obtain
FIELD1?FIELD2?FIELD3?...
so § used in the file is different from the one specified as a the delimiter, and it's not able to print out it too.
So what's the problem here? Maybe the original file should use 2 bytes to store §?
© Stack Overflow or respective owner