Guessing UTF-8 encoding

Posted by Dervin Thunk on Stack Overflow See other posts from Stack Overflow or by Dervin Thunk
Published on 2009-09-11T00:03:59Z Indexed on 2010/04/03 8:13 UTC
Read the original article Hit count: 487

Filed under:

utf-8

|

encoding

I have a question that may be quite naive, but I feel the need to ask, because I don't really know what is going on. I'm on Ubuntu.

Suppose I do

echo "t" > test.txt

if I then

file test.txt

I get test.txt:ASCII text

If I then do

echo "å" > test.txt

Then I get

test.txt: UTF-8 Unicode text

How does that happen? How does file "know" the encoding, or, alternatively, how does it guess it?

Thanks.

© Stack Overflow or respective owner

Related posts about utf-8

Why can't I change the AU_AU locale to en_US?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
/bin/bash: warning: setlocale: LC_ALL: cannot change locale ( (unset)) Generating locales... en_US.ISO-8859-1... /usr/sbin/locale-gen: line 177: warning: setlocale: LC_ALL: cannot change locale ( (unset)) done Generation complete. ganesha@ubuntu:~$ sudo update_locale LANG=en_US sudo: update_locale:… >>> More
Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that. I already read the question titled "std::wstring VS std::string. It was very helpful, but I still don't quite… >>> More
Reading a plist utf-8 value as utf-16

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm working on an iphone app that needs to display superscripts and subscripts. I'm using a picker to read in data from a plist but the unicode values aren't being displayed corretly in the pickerview. Subscripts and superscripts are not being recognized. I'm assuming this is due to the encoding… >>> More
Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8

as seen on Stack Overflow - Search for 'Stack Overflow'
Consider the following problem: A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed. I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1… >>> More
How can I tell if a CSV is in UTF-7 or UTF-8

as seen on Stack Overflow - Search for 'Stack Overflow'
Excel seems to save CSV files in (what I think is) UTF-7, despite the fact that most information I have read suggest that in general, you should not UTF-7. Indeed, other applications (Text pad, which lets me choose) save things in UTF-8 (or Unicode etc, but UTF-7 is not even an option). Using .NET… >>> More

Related posts about encoding

<?xml version=“1.0” encoding=“UTF-8”?> not <?xml version='1.0' encoding='UTF-8'?>

as seen on Stack Overflow - Search for 'Stack Overflow'
I am using lxml with tree.write(xmlFileOut, pretty_print = True, xml_declaration = True, encoding='UTF-8' to write out my opened and edited xml file, but I absolutely need to have the xml declaration as <?xml version=“1.0” encoding=“UTF-8”?> and NOT <?xml version='1.0' encoding='UTF-8'… >>> More
Ivar definitions show 'long' type encoding as 'long long' type encoding

as seen on Stack Overflow - Search for 'Stack Overflow'
I've found what I think may be a bug with Ivar and Objective-C runtime. I'm using XCode 3.2.1 and associated libraries, developing a 64 bit app on X86_64 (MacBook Pro). Where I would expect the type encoding for the following "longVal" to be 'l', the Ivar encoding is showing a 'q' (which is a 'long… >>> More
How to avoid encoding the key of request parameters being encoding

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to send a http request using WS.url() with a action receive a custom class parameter like public static void add(@Valid MyPage info) {...} There is a Map in MyPage @Required public Map<String, String> content = new HashMap<String, String>(); But When I try to send a request… >>> More
C# Check if character exists in encoding

as seen on Stack Overflow - Search for 'Stack Overflow'
I am writing a program that a part renders a bitmap font in CP437. In a function that renders the text with I want to be able to check whether a char is available in CP437 before the encoding conversion, like: public static void DrawCharacter(this Graphics g, char c) { if (char_exist_in_encoding(Encoding… >>> More
How to detect the character encoding of a text file?

as seen on Stack Overflow - Search for 'Stack Overflow'
I try to detect which character encoding is used in my file. I try with this code to get the standard encoding public static Encoding GetFileEncoding(string srcFile) { // *** Use Default of Encoding.Default (Ansi CodePage) Encoding enc = Encoding.Default; // *** Detect byte… >>> More