Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8 in Perl
- by knorv
Consider the following problem:
A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed.
I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in the processing I want to provide a "best effort result" rather than throwing an error.
My current attempt looks like this:
$junk = &force_utf8($junk);
sub force_utf8 {
my $input = shift;
my $output = '';
foreach my $line (split(/\n/, $input)) {
if (utf8::valid($line)) {
utf8::decode($line);
}
$output .= "$line\n";
}
return $output;
}
While this appears to work I'm certain this is not the optimal solution.
How would you improve the force_utf8(...) sub?