I encountered a problem dealing with UTF-8, XML and Perl. The following is the smallest
piece of code and data in order to reproduce the problem.
Here's an XML file that needs to be parsed:
<?xml version="1.0" encoding="utf-8"?>
<test>
  <words>???????????? ??????? ????????? ???? ???????????? ??????</words>
  <words>???????????? ??????? ????????? ???? ???????????? ??????</words>
  <words>???????????? ??????? ????????? ???? ???????????? ??????</words>
  [<words> .... </words> 148 times repeated]
  <words>???????????? ??????? ????????? ???? ???????????? ??????</words>
  <words>???????????? ??????? ????????? ???? ???????????? ??????</words>
</test>
The parsing is done with this perl script:
use warnings;
use strict;
use XML::Parser;
use Data::Dump;
my $in_words = 0;
my $xml_parser=new XML::Parser(Style=>'Stream');
$xml_parser->setHandlers (
   Start   => \&start_element,
   End     => \&end_element,
   Char    => \&character_data,
   Default => \&default);
open OUT, '>out.txt'; binmode (OUT, ":utf8");
open XML, 'xml_test.xml' or die;
$xml_parser->parse(*XML);
close XML;
close OUT;
sub start_element {
  my($parseinst, $element, %attributes) = @_;
  if ($element eq 'words') {
    $in_words = 1;
  }
  else {
    $in_words = 0;
  }
}
sub end_element {
  my($parseinst, $element, %attributes) = @_;
  if ($element eq 'words') {
    $in_words = 0;
  }
}
sub default {
  # nothing to see here;
}
sub character_data {
  my($parseinst, $data) = @_;
  if ($in_words) {
    if ($in_words) {
      print OUT "$data\n";
    }
  }
}
When the script is run, it produces the out.txt file. The problem is in this
file on line 147. The 22th character (which in utf-8 consists of \xd6 \xb8) is split
between the d6 and b8 with a new line. This should not happen.
Now, I am interested if someone else has this problem or can reproduce it.
And why I am getting this problem.
I am running this script on Windows:
C:\temp>perl -v
This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)
Copyright 1987-2007, Larry Wall
Binary build 1003 [285500] provided by ActiveState http://www.ActiveState.com
Built May 13 2008 16:52:49