utf 8 - Page 20 - Developer IT

Four byte encoding of U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS)?

- by knorv

Which character encoding represents the character ö (U+00F6, LATIN SMALL LETTER O WITH DIAERESIS or simply put chr(246) in ISO-8859-1) as the four octets combination chr(195) . chr(63) . chr(194) . chr(164)?

Read the article

Non-Latin characters in URLs - is it better to encode them or replace with their Latin "counterparts

- by Pawel Krakowiak

We're implementing a blog for a site which supports six different languages and five of them have non-Latin characters in their alphabets. We are not sure whether we should have them encoded (that is what we're doing at the moment) Létání s potravinami: Co je dovoleno? becomes l%c3%a9t%c3%a1n%c3%ad-s-potravinami-co-je-dovoleno and the browser displays it as létání-s-potravinami-co-je-dovoleno. or if we should replace them with their Latin "counterparts" (similar looking letters) Létání s potravinami: Co je dovoleno? becomes letani-s-potravinami-co-je-dovoleno. I can't find a definitive answer as to what's better from SEO perspective? Search engine optimization is very important for us. Which approach would you suggest?

Read the article

Best way to convert between [Char] and [Word8]?

- by cmars232

I'm new to Haskell and I'm trying to use a pure SHA1 implementation in my app (Data.Digest.Pure.SHA) with a JSON library (AttoJSON). AttoJSON uses Data.ByteString.Char8 bytestrings, SHA uses Data.ByteString.Lazy bytestrings, and some of my string literals in my app are [Char]. This article seems to indicate this is something still being worked out in the Haskell language/Prelude: http // hackage.haskell.org/trac/haskell-prime/wiki/CharAsUnicode And this one lists a few libraries but its a couple years old: http //blog.kfish.org/2007/10/survey-haskell-unicode-support.html [Links broken because SO doesn't trust me -- whatever...] What is the current best way to convert between these types, and what are some of the tradeoffs? I don't want to pick something that is obsolete... Thanks!

Read the article

Differences between utf8 and latin1

- by binbash

what is the difference between utf8 and latin1?

Read the article

Form submission info showing up in URL and not working

- by kcurtin

I am making a Rails 3.1 app and have a signup form that was working fine, but I seemed to have changed something to break it.. I'm using Twitter bootstrap and twitter_bootstrap_form_for gem. I made some change that messed with the formatting of the form fields, but more importantly, when I submit the Sign Up form to create a new User, the information is showing up in the URL and looks like this: EDIT: This is happening in the latest versions of Chrome and Firefox http://localhost:3000/?utf8=%E2%9C%93&authenticity_token=UaKG5Y8fuPul2Klx7e2LtdPLTRepBxDM3Zdy8S%2F52W4%3D&user%5Bemail%5D=kevinc%40example.com&user%5Bpassword%5D=testing&user%5Bpassword_confirmation%5D=testing&commit=Sign+Up Here is the code for the form: <div class="span7"> <h3 class="center" id="more">Sign Up Now!</h3> <%= twitter_bootstrap_form_for @user do |user| %> <%= user.email_field :email, :placeholder => '[email protected]' %> <%= user.password_field :password %> <%= user.password_field :password_confirmation, 'Confirm Password' %> <%= user.actions do %> <%= user.submit 'Sign Up' %> <% end %> <% end %> </div> Here is the code for the UsersController: class UsersController < ApplicationController def new @user = User.new end def create @user = User.new(params[:user]) if @user.save redirect_to about_path, :notice => "Signed up!" else render 'new' end end end Not sure if there is more you need but if so let me know! Thank you! Edit: For debugging I tried specifying :post and also using a plain form_for <%= form_for(@user, :method => :post) do |f| %> <div class="field"> <%= f.label :email %> <%= f.email_field :email %> </div> <div class="field"> <%= f.label :password %> <%= f.password_field :password %> </div> <div class="field"> <%= f.label :password_confirmation %> <%= f.password_field :password_confirmation %> </div> <div class="actions"><%= f.submit "Sign Up" %></div> <% end %> This gives me the same problem as above. Adding routes.rb: Auth31::Application.routes.draw do get "home" => "pages#home" get "about" => "pages#about" get "contact" => "pages#contact" get "help" => "pages#help" get "login" => "sessions#new", :as => "login" get "logout" => "sessions#destroy", :as => "logout" get "signup" => "users#new", :as => "signup" root :to => "pages#home" resources :pages resources :users resources :sessions resources :password_resets end

Read the article

Jquery, how to escape quotes

- by Sandro Antonucci

I'm using a simple jquery code that grabs html code form a tag and then puts this content into a form input <td class="name_cat" ><span class="name_cat">It's a "test" </span> (5)</td> jquery gets the content into span.name_catand returns it as It's a "test". So when I print this into an input it becomes <input value="It's a "test"" /> which as you can imagine will only show as It's a , the following double quote will close the value tag. What's the trick here to keep the original string while not showing utf8 code in the input? Jquery code $(".edit_cat").click(function(){ tr = $(this).parents("tr:first"); id_cat = $(this).attr("id"); td_name = tr.find(".name_cat"); span_name = tr.find("span.name_cat").html(); form = '<form action="/admin/controllers/edit_cat.php" method="post" >'+ '<input type="hidden" name="id_cat" value="'+id_cat+'" />'+ '<input type="text" name="name_cat" value="'+span_name+'" />'+ '<input type="submit" value="save" />'+ '</form>'; td_name.html(form); console.log(span_name); } ); I basically need html() not to decode Utf8

Read the article

Unicode generated by toEscapedUnicode method is without spaces

- by vishvesha

For this word ????????????? the Unicode is== \u0938\u0941\u0916\u091A\u0948\u0928\u093E\u0928\u0940 \u0930\u0940\u091D\u0941\u092E\u0932 \u091C\u093F\u0935\u0924\u0930\u093E\u092E and look it has spaces before \u0930 and \u091C But when I am trying in my code String tempString=Strings.toEscapedUnicode(strString); This method to convert to Unicode gives a result without spaces: \u0938\u0941\u0916\u091A\u0948\u0928\u093E\u0928\u0940\u0930\u0940\u091D\u0941\u092E\u0932\u091C\u093F\u0935\u0924\u0930\u093E\u092E and that's why they are not matching. My 'toEscapeUnicode' method generates Unicode without spaces. I want the spaces, so how to do it?

Read the article

mine phrases (up to 3 words) from a given text

- by DS_web_developer

I asked before for a simple solution to my problem (using sphinx search service) but I got nowhere... someone has kindly provided me with this code <?php /** * $Project: GeoGraph $ * $Id$ * * GeoGraph geographic photo archive project * This file copyright (C) 2005 Barry Hunter ([email protected]) * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ /** * Provides the methods for updating the worknet tables * * @package Geograph * @author Barry Hunter <[email protected]> * @version $Revision$ */ function addTwoLetterPhrase($phrase) { global $w2; $w2[$phrase] = (isset($w2[$phrase]))?($w2[$phrase]+1):1; } function addThreeLetterPhrase($phrase) { global $w3; $w3[$phrase] = (isset($w3[$phrase]))?($w3[$phrase]+1):1; } function updateWordnet(&$db,$text,$field,$id) { global $w1,$w2,$w3; $alltext = strtolower(preg_replace('/\W+/',' ',str_replace("'",'',$text))); if (strlen($text)< 1) return; $words = preg_split('/ /',$alltext); $w1 = array(); $w2 = array(); $w3 = array(); //build a list of one word phrases foreach ($words as $word) { $w1[$word] = (isset($w1[$word]))?($w1[$word]+1):1; } //build a list of two word phrases $text = $alltext; $text = preg_replace('/(\w+) (\w+)/e','addTwoLetterPhrase("$1 $2")',$text); $text = $alltext; $text = preg_replace('/(\w+)/','',$text,1); $text = preg_replace('/(\w+) (\w+)/e','addTwoLetterPhrase("$1 $2")',$text); //build a list of three word phrases $text = $alltext; $text = preg_replace('/(\w+) (\w+) (\w+)/e','addThreeLetterPhrase("$1 $2 $3")',$text); $text = $alltext; $text = preg_replace('/(\w+)/','',$text,1); $text = preg_replace('/(\w+) (\w+) (\w+)/e','addThreeLetterPhrase("$1 $2 $3")',$text); $text = $alltext; $text = preg_replace('/(\w+) (\w+)/','',$text,1); $text = preg_replace('/(\w+) (\w+) (\w+)/e','addThreeLetterPhrase("$1 $2 $3")',$text); foreach ($w1 as $word=>$count) { $db->Execute("insert into wordnet1 set gid = $id,words = '$word',$field = $count");// ON DUPLICATE KEY UPDATE $field=$field+$count"); } foreach ($w2 as $word=>$count) { $db->Execute("insert into wordnet2 set gid = $id,words = '$word',$field = $count"); } foreach ($w3 as $word=>$count) { $db->Execute("insert into wordnet3 set gid = $id,words = '$word',$field = $count"); } } ?> It works fine and does almost exactly what I need....... except.... it is not utf8 friendly... I mean... it splits whole words into parts (on special chars) where it shouldn't! so my guess is I should use multibyte functions instead of regular preg_replace... I tried to replace preg_replace with mb_ereg_replace but it is not working as it should... at least not for 2 and 3 words phrases any ideas?

Read the article

Jython 'unknown enocoding ms932' on japanese system

- by Houtman

i've written a program in Jython 2.5.1 which works fine on my Windows 7 machine, but on a japanese machine it throws an Exception saying "unknown encoding 'ms932'" i found that codecs.java is the only module printing the unknown encoding 'xyz' message this file loads aliases.py which does contain # cp932 codec '932' : 'cp932', 'ms932' : 'cp932', 'mskanji' : 'cp932', 'ms_kanji' : 'cp932', The file cp932.py contains import _codecs_jp, codecs But.. _codecs_jp does not exist as is also discussed in this page Does anyone have a clue where to go from here ? http://web.archiveorange.com/archive/v/8tc1Zc2rV3qiUcy9zPlA

Read the article

Contents of a node in Nokogiri

- by Styggentorsken

Is there a way to select all the contents of a node in Nokogiri? <root> <element>this is <hi>the content</hi> of my æøå element</element> </root> The result of getting the content of /root/element should be this is <hi>the content</hi> of my æøå element Edit: It seems like the solution is simply to use myElement.inner_html(). The problem I had was in fact that I was relying on an old version of libxml2, which escaped all the special characters.

Read the article

Using php to create a password system with chinese characters

- by WillDonohoe

Hi guys, I'm having an issue with validating chinese characters against other chinese characters, for example I'm creating a simple password script which gets data from a database, and gets the user input through get. The issue I'm having is for some reason, even though the characters look exactly the same when you echo them out, my if statement still thinks they are different. I have tried using the htmlentities() function to encode the characters, the password from the database encodes nicely, giving me a working '& #35441;' (I've put a space in it to stop it from converting to a chinese character!). The other user input value gives me a load of funny characters. The only thing which I believe must be breaking it, is it encodes in a different way and therefore the php thinks it's 2 completely different strings. Does anybody have any ideas? Thanks in advance, Will

Read the article

PHP: Convert web-page to utf8

- by Paul Tarjan

I would like to only work with UTF8. The problem is I don't know the charset of every webpage. How can I detect it and convert to UTF8? <?php $url = "http://vkontakte.ru"; $ch = curl_init($url); $options = array( CURLOPT_RETURNTRANSFER => true, ); curl_setopt_array($ch, $options); $data = curl_exec($ch); // $data = magic($data); print $data; See this at: http://paulisageek.com/tmp/curl-utf8 What is magic()?

Read the article

NSUTF8StringEncoding gives me this %0A%20%20%20%20%22http://example.com/example.jpg%22%0A

- by user1530141

So I'm trying to load pictures from twitter. If i just use the URL in the json results without encoding, in the dataWithContentsOfURL, I get nil URL argument. If I encode it, I get %0A%20%20%20%20%22http://example.com/example.jpg%22%0A. I know I can use rangeOfString: or stringByReplacingOccurrencesOfString: but can I be sure that it will always be the same, is there another way to handle this, and why is this happening to my twitter response and not my instagram response? i have also tried stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet] and it does nothing. this is the url directly from the json... 2013-11-08 22:09:31:812 JaVu[1839:1547] -[SingleEventTableViewController tableView:cellForRowAtIndexPath:] [Line 406] ( "http://pbs.twimg.com/media/BYWHiq1IYAAwSCR.jpg" ) here is my code if ([post valueForKeyPath:@"entities.media.media_url"]) { NSString *twitterString = [[NSString stringWithFormat:@"%@", [post valueForKeyPath:@"entities.media.media_url"]]stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; twitterString = [twitterString stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding]; NSLog(@"%@", twitterString); if (twitterString != nil){ NSURL *twitterPhotoUrl = [NSURL URLWithString:twitterString]; NSLog(@"%@", twitterPhotoUrl); dispatch_queue_t queue = kBgQueue; dispatch_async(queue, ^{ NSError *error; NSData* data = [NSData dataWithContentsOfURL:twitterPhotoUrl options:NSDataReadingUncached error:&error]; UIImage *image = [UIImage imageWithData:data]; dispatch_sync(dispatch_get_main_queue(), ^{ [streamPhotoArray replaceObjectAtIndex:indexPath.row withObject:image]; cell.instagramPhoto.image = image; }); }); } }

Read the article

Problem with single quotes in man pages

- by Peter

When I ssh into my Debian Lenny server and open a man page, single quotes appear to be messed up. Example from the man page of apt-get: If no package matches the given expression and the expression contains one of Â´.Â´, Â´?Â´ or Â´*Â´ then it is assumed to be a POSIX regular expression, and it is applied to all package names in the database. Any matches are then installed (or removed). Note that matching is done by substring so Â´lo.*Â´ matches Â´how-loÂ´ and Â´lowestÂ´. If this is undesired, anchor the regular expression with a Â´^Â´ or Â´$Â´ character, or create a more specific regular expression. I'm on Mac OS X and using xterm. If I use Terminal, the problem doesn't happen. My locale is configured correctly as far as I can see: $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= I'm not sure what's wrong with my environment, and I have no idea what to check next. I'd appreciate help.

Read the article

locale: What is the LANGUAGE variable used for? (and when?)

- by seya

I am trying to understand the locales used in Linux. On my Ubuntu 11.10 system locale puts out the following: LANG=en_DK.UTF-8 LANGUAGE=en_GB:en LC_CTYPE=en_GB.UTF-8 LC_NUMERIC="en_DK.UTF-8" LC_TIME="en_DK.UTF-8" LC_COLLATE=en_GB.UTF-8 LC_MONETARY="en_DK.UTF-8" LC_MESSAGES=en_GB.UTF-8 LC_PAPER="en_DK.UTF-8" LC_NAME="en_DK.UTF-8" LC_ADDRESS="en_DK.UTF-8" LC_TELEPHONE="en_DK.UTF-8" LC_MEASUREMENT="en_DK.UTF-8" LC_IDENTIFICATION="en_DK.UTF-8" LC_ALL= (en_dk is for using international day format, continental European number formatting (1.234,56) etc.) I think I understand what the LC_* family does, that LANG is the fallback if one of them is not set and that LC_ALL sets all of the LC_* variables to its value. What I don't know yet, is what LANGUAGE is used for. The notation en_GB:en reminds me of the Accept-Language HTTP header. With the settings above it would mean, British English is used, if a translation for it exists. Otherwise any existing English translation (en_US, en_AU, ..., whatever) would be used. Am I right so far? Also what programs actually obey the LANGUAGE setting? In how far is it different from LC_MESSAGES? Unfortunately, man locale only documents the LC_* family. And searching the web for 'linux locale LANGUAGE' or similar is a mute point. (Of course language is a word often used when talking about locales, and it may also be shown just in the output of locale without being discussed). Does anybody of you can help me out there?

Read the article

How web server choose between unicode and utf-8 for accentued characters?

- by jacques

I have a web server with my ISP which replaces in the urls the accentued characters by their unicode values: for instance é (eacute) is translated to %e9 (dec 233). For testing I use locally Easyphp which translate those characters by their utf-8 equivalence: é is then replaced by the well known sequence %c3%a9 (Ã©)... Browsers served by Easyphp don't decode unicode values but they do if running locally (utf-8 and non converted accent also)... I have been unable to find where this behavior is configured in the server. This is a problem as some urls are built by my application using the php rawurlencode() which seems to always encode with unicode values on both servers. Any idea? Thanks in advance.

Read the article

How does the web server choose between unicode and utf-8 for accented characters?

- by jacques

I have a web server with my ISP which replaces accented characters in URLs with their unicode values: for instance é (eacute) is translated to %e9 (dec 233). For testing locally I use EasyPHP which translates those characters by their utf-8 equivalence: é is then replaced by the well known sequence %c3%a9 (Ã©)... Browsers served by EasyPHP don't decode unicode values but they do if running locally (utf-8 and non converted accent also)... I have been unable to find where this behavior is configured in the server. This is a problem as some urls are built by my application using the php rawurlencode() which seems to always encode with unicode values on both servers. Any idea?

Read the article

How to make Horde connect to mysql with UTF-8 character set?

- by jkj

How to tell horde 3.3.11 to use UTF-8 for it's mysql connection? The $conf['sql']['charset'] only tells horde what is expected from the database. Horde uses MDB2 to connect to mysql. Is there way to force MDB2 or mysql character_set_client from php.ini? So far I found two workarounds: Force mysql to ignore character set requested by client [mysqld] skip-character-set-client-handshake=1 default-character-set=utf8 Force mysql to run SET NAMES utf8 on every connection [mysqld] init-connect='SET NAMES utf8' Both have drawbacks on multi user mysql server. The first disables converting character sets alltogether and the second one forces every connection to produce UTF-8. [EDIT] Found the problem. The 'charset' parameter was unset the last minute before sending to SQL backend. This is probably due to mysql not being able to digest utf-8 but utf8. Mysql specific mapping is required to make it work. I just worked around it by translating utf-8 - utf8. Won't work with any other databases with this patch though. --- lib/Horde/Share/sql.php.orig 2011-07-04 17:09:33.349334890 +0300 +++ lib/Horde/Share/sql.php 2011-07-04 17:11:06.238636462 +0300 @@ -753,7 +753,13 @@ /* Connect to the sql server using the supplied parameters. */ require_once 'MDB2.php'; $params = $this->_params; - unset($params['charset']); + + if ($params['charset'] == 'utf-8') { + $params['charset'] = 'utf8'; + } else { + unset($params['charset']); + } + $this->_write_db = &MDB2::factory($params); if (is_a($this->_write_db, 'PEAR_Error')) { Horde::fatal($this->_write_db, __FILE__, __LINE__); @@ -792,7 +798,13 @@ /* Check if we need to set up the read DB connection seperately. */ if (!empty($this->_params['splitread'])) { $params = array_merge($params, $this->_params['read']); - unset($params['charset']); + + if ($params['charset'] == 'utf-8') { + $params['charset'] = 'utf8'; + } else { + unset($params['charset']); + } + $this->_db = &MDB2::singleton($params); if (is_a($this->_db, 'PEAR_Error')) { Horde::fatal($this->_db, __FILE__, __LINE__);

Read the article

In utf-8 collation, why 11- is less then 1- ?

- by ???

I found that the sort result in ASCII: 1- 11- and in UTF-8: 11- 1- I feel it's so counter-intuitive, and it's not dictionary order. Isn't the character '-' (002d) is always less then [0-9] (0030-0039)? What's the general rule in UTF-8 collation? And how to bypass it, just make - be less then [0-9] while keep other characters unchanged for UTF-8, in Linux? (So it can affects the result of ls --sort, sort, etc. )

Read the article

Strings are UTF-16…. There is an error in XML document (1, 1).

- by Shawn Cicoria

I had a situation today where an xml document had a directive indicating it was utf-8. So, the code in question was reading in the “string” of that xml then attempting to de-serialize it using an Xsd generated type. What you end up with is an exception indicating that there’s an error in the Xml document at (1,1) or something to that effect. The fix is, run it through a memory stream – which reads the string, but at utf8 bytes – if you have things that fall outside of 8 bit chars, you’ll get an exception. //Need to read it to bytes, to undo the fact that strings are UTF-16 all the time. //We want it to handle it as UTF8. byte[] bytes = Encoding.UTF8.GetBytes(_myXmlString); TargetType myInstance = null; using (MemoryStream memStream = new MemoryStream(bytes)) { XmlSerializer tokenSerializer = new XmlSerializer(typeof(TargetType)); myInstance = (TargetType)tokenSerializer.Deserialize(memStream); } Writing is similar – also, adding the default namespace prevents the additional xmlns additions that aren’t necessary: XmlWriterSettings settings = new XmlWriterSettings() { Encoding = Encoding.UTF8, Indent = true, NewLineOnAttributes = true, }; XmlSerializerNamespaces xmlnsEmpty = new XmlSerializerNamespaces(); xmlnsEmpty.Add("", "http://www.wow.thisworks.com/2010/05"); MemoryStream memStr = new MemoryStream(); using (XmlWriter writer = XmlTextWriter.Create(memStr, settings)) { XmlSerializer tokenSerializer = new XmlSerializer(typeof(TargetType)); tokenSerializer.Serialize(writer, theInstance, xmlnsEmpty); }

Read the article

Is using HTML entities (for language-specific characters) in UTF-8 necessary?

- by Drachenzauberei

As in the subject-line. Saw the situation the other day on a page which felt weird to me. Except for markup-delimiting characters such as pointy brackets or the ampersand, escaping, say, German umlauts shouldn't be necessary, should it? Checked the encoding server-side, in-page and by way of HTTP headers, looks completely UTF-8 to me. What's your take on this and do you reckon it could adversely affect SEO or SERP placement?the page

Read the article

Problem with cyrillic symbols in console

- by woto

Hi everyone, sorry for bad English. It's Ruby code. s = "???????" `touch #{s}` `cat #{s}` `cat < #{s}` Can anybody tell why it's code fails? With sh: cannot open ???????: No such file But thic code works fine s = "????????" `touch #{s}` `cat #{s}` `cat < #{s}` Problem is only when Russian symbol '?' in the word and with symobol '<' woto@woto-work:/tmp$ locale LANG=ru_RU.UTF-8 LC_CTYPE="ru_RU.UTF-8" LC_NUMERIC="ru_RU.UTF-8" LC_TIME="ru_RU.UTF-8" LC_COLLATE="ru_RU.UTF-8" LC_MONETARY="ru_RU.UTF-8" LC_MESSAGES="ru_RU.UTF-8" LC_PAPER="ru_RU.UTF-8" LC_NAME="ru_RU.UTF-8" LC_ADDRESS="ru_RU.UTF-8" LC_TELEPHONE="ru_RU.UTF-8" LC_MEASUREMENT="ru_RU.UTF-8" LC_IDENTIFICATION="ru_RU.UTF-8" LC_ALL= woto@woto-work:/tmp$ ruby -v ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux] woto@woto-work:/tmp$ uname -a Linux woto-work 2.6.32-26-generic #48-Ubuntu SMP Wed Nov 24 10:14:11 UTC 2010 x86_64 GNU/Linux woto@woto-work:/tmp$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 10.04.1 LTS Release: 10.04 Codename: lucid

Read the article

Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

- by dan04

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space). There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex). Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent? By "UTF-8-like", I mean, at minimum: The bytes 0x00-0x7F are reserved for ASCII characters. Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.

Read the article

How can I find all UTF-16 encoded text files in a directory tree with a unix command?

- by Jochen

Hi, I want to use a *nix shell command to find all UTF-16 encoded files (containing the UTF-16 Byte Order Mark) in a directory tree. Is there a command that I can use? Regards, Jochen

Read the article

Fix stubborn 'Setting locale failed.'

- by plua

I have a very stubborn, well-known locale error on Ubuntu 9.10: perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_TIME = "custom.UTF-8", LANG = "en_US.UTF-8" Tried the following: Added LANG=en_US.UTF-8 and LC_ALL=en_US.UTF-8 to /etc/environment Run apt-get install --reinstall locales (error: perl: warning: Falling back to the standard locale ("C"). /usr/bin/mandb: can't set the locale; make sure $LC_* and $LANG are correct) Run sudo dpkg-reconfigure locales. Result: Cannot set LC_ALL to default locale: No such file or directory, and then updates locales all locales including en_US.UTF-8 sudo locale-gen updates all locales successfully, including en_US.UTF-8 sudo locale-gen un_US en_US.UTF-8 gives no error nor other output In /etc/default/locale it says LANG="en_US.UTF-8" echo $LANG gives en_US.UTF-8 /var/lib/locales/supported.d/local says en_US.UTF-8 UTF-8 locale -a gives me: C en_AG en_AU.utf8 en_BW.utf8 en_CA.utf8 en_DK.utf8 en_GB.utf8 en_HK.utf8 en_IE.utf8 en_IN en_NG en_NZ.utf8 en_PH.utf8 en_SG.utf8 en_US.utf8 en_ZA.utf8 en_ZW.utf8 POSIX So well... I am pretty much out of options I can think of. Anybody any idea?? Thanks!

Search Results

Search found 4604 results on 185 pages for 'utf 8'.

Page 20/185 | < Previous Page | 16 17 18 19 20 21 22 23 24 25 26 27 | Next Page >

- by knorv

- by Pawel Krakowiak

- by cmars232

- by binbash

- by kcurtin

- by Sandro Antonucci

- by vishvesha

- by DS_web_developer

- by Houtman

- by Styggentorsken

- by WillDonohoe

- by Paul Tarjan

- by user1530141

- by Peter

- by seya

- by jacques

- by jacques

- by jkj

- by ???

- by Shawn Cicoria

- by Drachenzauberei

- by woto

- by dan04

- by Jochen

- by plua

< Previous Page | 16 17 18 19 20 21 22 23 24 25 26 27 | Next Page >