I encountered a problem dealing with utf8, XML and Perl. The following is the smallest
piece of code and data in order to reproduce the problem.
Here's an XML file that needs to be parsed:
<?xml version="1.0" encoding="utf-8"?>
<test>
<words>???????????? ??????? ????????? ???? ???????????? ??????</words>
<words>???????????? ??????? ????????? ???? ???????????? ??????</words>
<words>???????????? ??????? ????????? ???? ???????????? ??????</words>
[<words> .... </words> 148 times repeated]
<words>???????????? ??????? ????????? ???? ???????????? ??????</words>
<words>???????????? ??????? ????????? ???? ???????????? ??????</words>
</test>
The parsing is done with this perl script:
use warnings;
use strict;
use XML::Parser;
use Data::Dump;
my $in_words = 0;
my $xml_parser=new XML::Parser(Style=>'Stream');
$xml_parser->setHandlers (
Start => \&start_element,
End => \&end_element,
Char => \&character_data,
Default => \&default);
open OUT, '>out.txt'; binmode (OUT, ":utf8");
open XML, 'xml_test.xml' or die;
$xml_parser->parse(*XML);
close XML;
close OUT;
sub start_element {
my($parseinst, $element, %attributes) = @_;
if ($element eq 'words') {
$in_words = 1;
}
else {
$in_words = 0;
}
}
sub end_element {
my($parseinst, $element, %attributes) = @_;
if ($element eq 'words') {
$in_words = 0;
}
}
sub default {
# nothing to see here;
}
sub character_data {
my($parseinst, $data) = @_;
if ($in_words) {
if ($in_words) {
print OUT "$data\n";
}
}
}
When the script is run, it produces the out.txt file. The problem is in this
file on line 147. The 22th character (which in utf-8 consists of \xd6 \xb8) is split
between the d6 and b8 with a new line. This should not happen.
Now, I am interested if someone else has this problem or can reproduce it.
And why I am getting this problem.
I am running this script on Windows:
C:\temp>perl -v
This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)
Copyright 1987-2007, Larry Wall
Binary build 1003 [285500] provided by ActiveState http://www.ActiveState.com
Built May 13 2008 16:52:49