How do you get Matlab to write the BOM (byte order markers) for UTF-16 text files?
- by Richard Povinelli
I am creating UTF16 text files with Matlab, which I am later reading in using Java. In Matlab, I open a file called fileName and write to it as follows:
fid = fopen(fileName, 'w','n','UTF16-LE');
fprintf(fid,"Some stuff.");
In Java, I can read the text file using the following code:
FileInputStream fileInputStream = new FileInputStream(fileName);
Scanner scanner = new Scanner(fileInputStream, "UTF-16LE");
String s = scanner.nextLine();
Here is the hex output:
Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13
00000000 73 00 6F 00 6D 00 65 00 20 00 73 00 74 00 75 00 66 00 66 00 s.o.m.e. .s.t.u.f.f.
The above approach works fine. But, I want to be able to write out the file using UTF16 with a BOM to give me more flexibility so that I don't have to worry about big or little endian. In Matlab, I've coded:
fid = fopen(fileName, 'w','n','UTF16');
fprintf(fid,"Some stuff.");
In Java, I change the code to:
FileInputStream fileInputStream = new FileInputStream(fileName);
Scanner scanner = new Scanner(fileInputStream, "UTF-16");
String s = scanner.nextLine();
In this case, the string s is garbled, because Matlab is not writing the BOM. I can get the Java code to work just fine if I add the BOM manually. With the added BOM, the following file works fine.
Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15
00000000 FF FE 73 00 6F 00 6D 00 65 00 20 00 73 00 74 00 75 00 66 00 66 00 ÿþs.o.m.e. .s.t.u.f.f.
How can I get Matlab to write out the BOM? I know I could write the BOM out separately, but I'd rather have Matlab do it automatically.
Addendum
I selected the answer below from Amro because it exactly solves the question I posed.
One key discovery for me was the difference between the Unicode Standard and a UTF (Unicode transformation format) (see http://unicode.org/faq/utf_bom.html). The Unicode Standard provides unique identifiers (code points) for characters. UTFs provide mappings of every code point "to a unique byte sequence." Since all but a handful of the characters I am using are in the first 128 code points, I'm going to switch to using UTF-8 as Romeo suggests. UTF-8 is supported by Matlab (The warning shown below won't need to be suppressed.) and Java, and for my application will generate smaller text files.
I suppress the Matlab warning
Warning: The encoding 'UTF-16LE' is not supported.
with
warning off MATLAB:iofun:UnsupportedEncoding;