Java parsing UTF8

Posted by Jack on Stack Overflow See other posts from Stack Overflow or by Jack
Published on 2010-04-06T16:37:18Z Indexed on 2010/04/06 16:43 UTC
Read the original article Hit count: 502

Filed under:

java

|

utf8

|

charset

I have the following issue with a UTF8 files structured as following:

FIELD1§FIELD2§FIELD3§FIELD4

Looking at hexadecimal values of the file it uses A7 to codify §. So according to this codify it should be UTF8, but it's strange because A7 > 7F so 1 byte shouldn't be enough to codify §.

So I tried using directly a BufferedReader with a specified charset:

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(input), utf8))

but when I try to tokenize the string with

SmartTokenizer st = new SmartTokenizer(toTokenize, "§")

(the SmartTokenizer is a modified version of the StringTokenizer that keeps empty tokens)

no splitting occurs, and if I try to print the string I obtain

FIELD1?FIELD2?FIELD3?...

so § used in the file is different from the one specified as a the delimiter, and it's not able to print out it too.

So what's the problem here? Maybe the original file should use 2 bytes to store §?

© Stack Overflow or respective owner

Related posts about java

Tomcat 6: Access Control Exception?

as seen on Server Fault - Search for 'Server Fault'
I'm trying to setup a tomcat6 server, and I'm trying to match another setup someone else established. However, my deployment (default Ubuntu install) uses a policy.d/ directory structure, and the established server just uses a catalina.policy file. I've tried setting every entry in policy.d to match… >>> More
Problem in creation MDB Queue connection at Jboss StartUp

as seen on Stack Overflow - Search for 'Stack Overflow'
I am not able to create a Queue connection in JBOSS4.2.3GA Version & Java1.5, as I am using MDB as per the below details. I am putting this MDB in a jar file(named utsJar.jar) and copied it in deploy folder of JBOSS, In the test env. this MDB works well but in another env. [ env settings and… >>> More
failing to establish connection between Postgres db and gwt

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am using Postgres and gwt 2.0 for one of my applications. I am facing problem connecting to the database. When I try to connect it gives "ClassNotFoundException". Here is what I get when I try to connect to database: java.lang.ClassNotFoundException: org.postgresql.Driver at java.net… >>> More
failing to establish connection between postgre db and gwt

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, For i am using postgre and gwt 2.0 for one of my applications. I am facing problem connecting to the database. When i try to connect it gives "ClassNotFoundException". Here is what i get when i try to connect to database: java.lang.ClassNotFoundException: org.postgresql.Driver at java.net… >>> More
Migration and deployement problems JBoss 4.2.2.GA to JBoss 6.0.0.M2

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I'm trying to migrate an application running on JBoss 4.2.2.GA to JBoss 6.0.0.M2 I give you some log to explain my problem : boot.log : 2010-03-16 09:59:29,406 ERROR [org.jboss.system.server.profileservice.ProfileServiceBootstrap] (Thread-2) Failed to load profile: Summary of incomplete deployments… >>> More

Related posts about utf8

How can I install new locale to Ubuntu?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
$ locale -a get output like this: C C.UTF-8 en_AG en_AG.utf8 en_AU.utf8 en_BW.utf8 en_CA.utf8 en_DK.utf8 en_GB.utf8 en_HK.utf8 en_IE.utf8 en_IN en_IN.utf8 en_NG en_NG.utf8 en_NZ.utf8 en_PH.utf8 en_SG.utf8 en_US.utf8 en_ZA.utf8 en_ZM en_ZM.utf8 en_ZW.utf8 POSIX zh_CN.utf8 zh_SG.utf8 How can I… >>> More
Strange display language in gnome shell

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I logged in gnome-shell, and found that the display language is set to some strange asian language (I think) without my prompt. I tried to change the locale settings but found that the default language is English (how?) despite of that strange language. Here's a snapshot, See the strange word instead… >>> More
gVim characters unreadable at random times

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
Screenshot - Anyone know what causes it and how to fix? It only started happening today, while I've been using gVim for a couple of months now. Update: Output of locale LANG=en_US.utf8 LC_CTYPE="en_US.utf8" LC_NUMERIC="en_US.utf8" LC_TIME="en_US.utf8" LC_COLLATE="en_US.utf8" LC_MONETARY="en_US… >>> More
utf8 problem with Perl and XML::Parser

as seen on Stack Overflow - Search for 'Stack Overflow'
I encountered a problem dealing with utf8, XML and Perl. The following is the smallest piece of code and data in order to reproduce the problem. Here's an XML file that needs to be parsed: <?xml version="1.0" encoding="utf-8"?> <test> <words>???????????? ??????? ????????? ??… >>> More
Applying languages / locale selectively: is it possible?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I am a Dutch user and prefer the my local date & time format, system wide. I have no trouble speaking or understanding English and find it very useful to have the rest of my system configured in English to make my life easier when I need to Google a term, for example. Is it possible to apply… >>> More