Should I convert overly-long UTF-8 strings to their shortest normal form?

Posted by Grant McLean on Stack Overflow See other posts from Stack Overflow or by Grant McLean
Published on 2010-04-30T10:54:08Z Indexed on 2010/04/30 22:27 UTC
Read the original article Hit count: 174

Filed under:

perl

|

utf-8

|

encoding

|

security

I've just been reworking my Encoding::FixLatin Perl module to handle overly-long UTF-8 byte sequences and convert them to the shortest normal form.

My question is quite simply "is this a bad idea"?

A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe.

Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form.

Am I missing something. Is there a hidden danger I haven't considered?

© Stack Overflow or respective owner

Related posts about perl

Munin on Centos 6 - missing perl MODULE_COMPAT_5.8.8

as seen on Server Fault - Search for 'Server Fault'
I'm trying to install Munin on a new VPS through yum install munin but I keep getting an error about a missing perl module: Requires: perl(:MODULE_COMPAT_5.8.8). This is the perl version currently installed: v5.10.1. I've searched all around and still haven't found a solution for this. Here's the… >>> More
Pain removing a perl rootkit

as seen on Server Fault - Search for 'Server Fault'
So, we host a geoservice webserver thing at the office. Someone apparently broke into this box (probably via ftp or ssh), and put some kind of irc-managed rootkit thing. Now I'm trying to clean the whole thing up, I found the process pid who tries to connect via irc, but i can't figure out who's… >>> More
How To Avoid a Perl script calling an Another Perl Script

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, i am calling a perl script client.pl from a main script to capture the output of client.pl in @output. is there anyway to avoid the use of these two files so i can use the output of client.pl in main.pl itself here is my code.... main.pl ======= my @output = readpipe("client.pl"); client… >>> More
Perl :how to sort dates in perl

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, How can I sort the dates in perl. my @dates = ( "02/11/2009" , "12/20/2001" , "11/21/2010" ) ; I have above dates in my array . How can I sort those dates... ? My date format is dd/mm/YYYY. >>> More
please suggest a perl book exclusively for perl programs

as seen on Stack Overflow - Search for 'Stack Overflow'
I want tha name of a perl book for only PERL PROGRAMS. The reason behind is I want to improve my programming skill in perl >>> More

Related posts about utf-8

Why can't I change the AU_AU locale to en_US?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
/bin/bash: warning: setlocale: LC_ALL: cannot change locale ( (unset)) Generating locales... en_US.ISO-8859-1... /usr/sbin/locale-gen: line 177: warning: setlocale: LC_ALL: cannot change locale ( (unset)) done Generation complete. ganesha@ubuntu:~$ sudo update_locale LANG=en_US sudo: update_locale:… >>> More
Reading a plist utf-8 value as utf-16

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm working on an iphone app that needs to display superscripts and subscripts. I'm using a picker to read in data from a plist but the unicode values aren't being displayed corretly in the pickerview. Subscripts and superscripts are not being recognized. I'm assuming this is due to the encoding… >>> More
Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8

as seen on Stack Overflow - Search for 'Stack Overflow'
Consider the following problem: A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed. I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1… >>> More
Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that. I already read the question titled "std::wstring VS std::string. It was very helpful, but I still don't quite… >>> More
How can I tell if a CSV is in UTF-7 or UTF-8

as seen on Stack Overflow - Search for 'Stack Overflow'
Excel seems to save CSV files in (what I think is) UTF-7, despite the fact that most information I have read suggest that in general, you should not UTF-7. Indeed, other applications (Text pad, which lets me choose) save things in UTF-8 (or Unicode etc, but UTF-7 is not even an option). Using .NET… >>> More