Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8 in Perl

Posted by knorv on Stack Overflow See other posts from Stack Overflow or by knorv
Published on 2010-03-31T17:46:27Z Indexed on 2010/03/31 18:33 UTC
Read the original article Hit count: 169

Consider the following problem:

A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed.

I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in the processing I want to provide a "best effort result" rather than throwing an error.

My current attempt looks like this:

$junk = &force_utf8($junk);

sub force_utf8 {
  my $input = shift;
  my $output = '';
  foreach my $line (split(/\n/, $input)) {
    if (utf8::valid($line)) {
      utf8::decode($line);
    }
    $output .= "$line\n";
  }
  return $output;
}

While this appears to work I'm certain this is not the optimal solution.

How would you improve the force_utf8(...) sub?

© Stack Overflow or respective owner

Related posts about perl

Related posts about utf-8