Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8
Posted
by knorv
on Stack Overflow
See other posts from Stack Overflow
or by knorv
Published on 2010-03-31T17:46:27Z
Indexed on
2010/03/31
18:13 UTC
Read the original article
Hit count: 725
Consider the following problem:
A multi-line string $junk
contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed.
I want to turn $junk
into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in the processing I want to provide a "best effort result" rather than throwing an error.
My current attempt looks like this:
$junk = &force_utf8($junk);
sub force_utf8 {
my $input = shift;
my $output = '';
foreach my $line (split(/\n/, $input)) {
if (utf8::valid($line)) {
utf8::decode($line);
}
$output .= "$line\n";
}
return $output;
}
While this appears to work I'm certain this is not the optimal solution.
How would you improve my force_utf8(...)
sub?
© Stack Overflow or respective owner