utf8 problem with Perl and XML::Parser

Posted by René Nyffenegger on Stack Overflow See other posts from Stack Overflow or by René Nyffenegger
Published on 2010-03-24T21:36:51Z Indexed on 2010/03/24 22:33 UTC
Read the original article Hit count: 593

Filed under:
|
|

I encountered a problem dealing with utf8, XML and Perl. The following is the smallest piece of code and data in order to reproduce the problem.

Here's an XML file that needs to be parsed:

<?xml version="1.0" encoding="utf-8"?>
<test>
  <words>???????????? ??????? ????????? ???? ???????????? ??????</words>
  <words>???????????? ??????? ????????? ???? ???????????? ??????</words>
  <words>???????????? ??????? ????????? ???? ???????????? ??????</words>

  [<words> .... </words> 148 times repeated]

  <words>???????????? ??????? ????????? ???? ???????????? ??????</words>
  <words>???????????? ??????? ????????? ???? ???????????? ??????</words>
</test>

The parsing is done with this perl script:

use warnings;
use strict;

use XML::Parser;
use Data::Dump;

my $in_words = 0;

my $xml_parser=new XML::Parser(Style=>'Stream');

$xml_parser->setHandlers (
   Start   => \&start_element,
   End     => \&end_element,
   Char    => \&character_data,
   Default => \&default);

open OUT, '>out.txt'; binmode (OUT, ":utf8");
open XML, 'xml_test.xml' or die;
$xml_parser->parse(*XML);
close XML;
close OUT;


sub start_element {
  my($parseinst, $element, %attributes) = @_;

  if ($element eq 'words') {
    $in_words = 1;
  }
  else {
    $in_words = 0;
  }
}

sub end_element {
  my($parseinst, $element, %attributes) = @_;

  if ($element eq 'words') {
    $in_words = 0;
  }
}

sub default {
  # nothing to see here;
}

sub character_data {
  my($parseinst, $data) = @_;

  if ($in_words) {
    if ($in_words) {
      print OUT "$data\n";
    }
  }
}

When the script is run, it produces the out.txt file. The problem is in this file on line 147. The 22th character (which in utf-8 consists of \xd6 \xb8) is split between the d6 and b8 with a new line. This should not happen.

Now, I am interested if someone else has this problem or can reproduce it. And why I am getting this problem. I am running this script on Windows:

C:\temp>perl -v

This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)

Copyright 1987-2007, Larry Wall

Binary build 1003 [285500] provided by ActiveState http://www.ActiveState.com
Built May 13 2008 16:52:49

© Stack Overflow or respective owner

Related posts about perl

Related posts about Xml