unicode - Page 23 - Developer IT

Code to strip diacritical marks using ICU

- by Paul J. Lucas

Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é would become a plain ASCII e) from a UnicodeString using the ICU library in C++? E.g.: UnicodeString strip_diacritics( UnicodeString const &s ) { UnicodeString result; // ... return result; } Assume that s has already been normalized. Thanks.

Read the article

python read utf8 text file problem

- by cpps

I have a problem with python about reading and print utf8 text file. I have a test.txt in utf8 encoding without BOM. This file has two characters in it: ?? The first character "?" is Chinese and the second "?" is Japanese. Now, When I use Ulipad (a python editor) to run the following code to read the txt file, and print these two characters. import codecs infile = "C:\\test.txt" f = codecs.open(infile, "r", "utf-8") s = f.read() print(s) I got this error, "UnicodeEncodeError: 'cp950' codec can't encode character '\u58f0' in position 1: illegal multibyte sequence" I found it caused from the second character "?" . But when I use the same code to test in python default GUI IDLE, it works to print the two characters with no error. So, how can I fix the problem. My running environment is python 3.1 , windows xp traditional Chinese.

Read the article

Why isn't wchar_t widely used in code for Linux / related platforms?

- by Ninefingers

This intrigues me, so I'm going to ask - for what reason is wchar_t not used so widely on Linux/Linux-like systems as it is on Windows? Specifically, the Windows API uses wchar_t internally whereas I believe Linux does not and this is reflected in a number of open source packages using char types. My understanding is that given a character c which requires multiple bytes to represent it, then in a char[] form c is split over several parts of char* whereas it forms a single unit in wchar_t[]. Is it not easier, then, to use wchar_t always? Have I missed a technical reason that negates this difference? Or is it just an adoption problem?

Read the article

"É" not getting converted to two bytes correctly.

- by ChrisF

Further to this question I've got a supplementary problem. I've found a track with an "É" in the title. My code: var playList = new StreamWriter(playlist, false, Encoding.UTF8); - private static void WriteUTF8(StreamWriter playList, string output) { byte[] byteArray = Encoding.UTF8.GetBytes(output); foreach (byte b in byteArray) { playList.Write(Convert.ToChar(b)); } } converts this to the following bytes: 195 137 which is being output as Ã followed by a square (which is an character that can't be printed in the current font). I've exported the same file to a playlist in Media Monkey at it writes the "É" as "Ã‰" - which I'm assuming is correct (as KennyTM pointed out). My question is, how do I get the "‰" symbol output? Do I need to select a different font and if so which one? UPDATE People seem to be missing the point. I can get the "É" written to the file using playList.WriteLine("É"); that's not the problem. The problem is that Media Monkey requires the file to be in the following format: #EXTINFUTF8:140,Yann Tiersen - Comptine D'Un Autre Ã‰tÃ©: L'AprÃ¨s Midi #EXTINF:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi #UTF8:04-Comptine D'Un Autre Ã‰tÃ©- L'AprÃ¨s Midi.mp3 04-Comptine D'Un Autre Été- L'Après Midi.mp3 Where all the "high-ascii" (for want of a better term) are written out as a pair of characters.

Read the article

How do I create JavaScript escape sequences in PHP?

- by ordinarytoucan

I'm looking for a way to create valid UTF-16 JavaScript escape sequence characters (including surrogate pairs) from within PHP. I'm using the code below to get the UTF-32 code points (from a UTF-8 encoded character). This works as JavaScript escape characters (eg. '\u00E1' for 'á') - until you get into the upper ranges where you get surrogate pairs (eg '??' comes out as '\u1D715' but should be '\uD835\uDF15')... function toOrdinal($chr) { if (ord($chr{0}) >= 0 && ord($chr{0}) <= 127) { return ord($chr{0}); } elseif (ord($chr{0}) >= 192 && ord($chr{0}) <= 223) { return (ord($chr{0}) - 192) * 64 + (ord($chr{1}) - 128); } elseif (ord($chr{0}) >= 224 && ord($chr{0}) <= 239) { return (ord($chr{0}) - 224) * 4096 + (ord($chr{1}) - 128) * 64 + (ord($chr{2}) - 128); } elseif (ord($chr{0}) >= 240 && ord($chr{0}) <= 247) { return (ord($chr{0}) - 240) * 262144 + (ord($chr{1}) - 128) * 4096 + (ord($chr{2}) - 128) * 64 + (ord($chr{3}) - 128); } elseif (ord($chr{0}) >= 248 && ord($chr{0}) <= 251) { return (ord($chr{0}) - 248) * 16777216 + (ord($chr{1}) - 128) * 262144 + (ord($chr{2}) - 128) * 4096 + (ord($chr{3}) - 128) * 64 + (ord($chr{4}) - 128); } elseif (ord($chr{0}) >= 252 && ord($chr{0}) <= 253) { return (ord($chr{0}) - 252) * 1073741824 + (ord($chr{1}) - 128) * 16777216 + (ord($chr{2}) - 128) * 262144 + (ord($chr{3}) - 128) * 4096 + (ord($chr{4}) - 128) * 64 + (ord($chr{5}) - 128); } } How do I adapt this code to give me proper UTF-16 code points? Thanks!

Read the article

Flex TextField won't accept "ü" and other "German" characters

- by erikcw

I'm having problems with Flex (3.5) auto converting "ü" into a "u". As soon as I paste the character in, it transforms. Is there something I need to turn on to enable these other character sets? I thought Flex supported UTF-8? Thanks!

Read the article

Space-saving character encoding for japanese?

- by Constantin

In my opinion a common problem: character encoding in combination with a bitmap-font. Most multi-language encodings have an huge space between different character types and even a lot of unused code points there. So if I want to use them I waste a lot of memory (not only for saving multi-byte text - i mean specially for spaces in my bitmap-font) - and VRAM is mostly really valuable... So the only reasonable thing seems to be: Using an custom mapping on my texture for i.e. UTF-8 characters (so that no space is waste). BUT: This effort seems to be same with use an own proprietary character encoding (so also own order of characters in my texture). In my specially case I got texture space for 4096 different characters and need characters to display latin languages as well as japanese (its a mess with utf-8 that only support generall cjk codepages). Had somebody ever a similiar problem (I really wonder, if not)? If theres already any approach? Edit: The same Problem is described here http://www.tonypottier.info/Unicode_And_Japanese_Kanji/ but it doesnt provide an real solution how to save these bitmapfont mappings to utf-8 space efficent. So any further help is welcome!

Read the article

How to generate pdf files _with_ utf-8 multibyte characters using Zend Framework

- by Sejanus

Hello, I've got a "little" problem with Zend Framework Zend_Pdf class. Multibyte characters are stripped from generated pdf files. E.g. when I write aabccdee it becomes abcd with lithuanian letters stripped. I'm not sure if it's particularly Zend_Pdf problem or php in general. Source text is encoded in utf-8, as well as the php source file which does the job. Thank you in advance for your help ;) P.S. I run Zend Framework v. 1.6 and I use FONT_TIMES_BOLD font. FONT_TIMES_ROMAN does work

Read the article

Ruby character encoding problems in netbeans and command wíndow

- by salgo60

I use netbeans as development IDE and runs the application from cmd but have problems to display ISO 8859-1 characters like åäö correct in both cmd window and when I run the application from netbeans Question: What is best practice to set it up Right now I do @output.puts indent + "V" + 132.chr + "lkommen till Ruby Camping!" to get ä My environment chcp 65001 Active code page: 65001 ruby main.rb Source encoding: <Encoding:US-ASCII> Default external: #<Encoding:UTF-8> Default internal: nil Locale charmap: "CP65001" where I have in the code def self.printEncoding puts "Source encoding: #{__ENCODING__.inspect}" if defined? __ENCODING__ if defined? Environment::Encoding puts "Default external: #{Encoding.default_external.inspect}" puts "Default internal: #{Encoding.default_internal.inspect}" puts "Locale charmap: #{ Encoding.locale_charmap.inspect}" end puts "LANG environment variable: #{ENV['LANG'].inspect}" unless ENV['LANG'].nil? end ruby -v ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

Read the article

Chinese/japanese characters in a search box and form.

- by alex

Why is it that when I use Firefox to enter: ?, the GET will transform to: q=%E6%BC%A2&start=0 However, when I use IE8 and I type the same chinese character, the GET is: q=?&start=0 It turns it into a question mark.

Read the article

Convert UTF-16 to UTF-8 under Windows and Linux, in C

- by DooriBar

I was wondering if there is a recommended 'cross' Windows and Linux method for the purpose of converting strings from UTF-16LE to UTF-8? or one should use different methods for each environment? I've managed to google few references to 'iconv' , but for somreason I can't find samples of basic conversions, such as - converting a wchar_t UTF-16 to UTF-8. Anybody can recommend a method that would be 'cross', and if you know of references or a guide with samples, would very appreciate it. Thanks, Doori Bar

Read the article

Weird SQL Server 2005 Collation difference between varchar() and nvarchar()

- by richardtallent

Can someone please explain this: SELECT CASE WHEN CAST('iX' AS nvarchar(20)) > CAST('-X' AS nvarchar(20)) THEN 1 ELSE 0 END, CASE WHEN CAST('iX' AS varchar(20)) > CAST('-X' AS varchar(20)) THEN 1 ELSE 0 END Results: 0 1 SELECT CASE WHEN CAST('i' AS nvarchar(20)) > CAST('-' AS nvarchar(20)) THEN 1 ELSE 0 END, CASE WHEN CAST('i' AS varchar(20)) > CAST('-' AS varchar(20)) THEN 1 ELSE 0 END Results: 1 1 On the first query, the nvarchar() result is not what I'm expecting, and yet removing the X make the nvarchar() sort happen as expected. (My original queries used the '' and N'' literal syntax to distinguish varchar() and nvarchar() rather than CAST() and got the same result.) Collation setting for the database is SQL_Latin1_General_CP1_CI_AS.

Read the article

Most Lite-Weight XML Parser with XPath and Wide-char Support

- by Mystagogue

I want a lite-weight C++ XML parser/DOM that: Can take UTF-8 as input, and parse into UTF-16. Maybe it does this directly (ideal!), or perhaps it provides a hook for the conversion (such as taking a custom stream object that does the conversion before parsing). Offers some XPath support. I've been looking at RapidXML, the Kranf xmlParser, and pugiXML. The first two of those might permit requirement #1 by way of a hook. The third, pugiXML, supports the #2 requirement. But none of those three fulfill both requirements. What is the smallest (free) library that can handle both requirements?

Read the article

tchar safe functions -- count parameter for UTF-8 constants

- by Dustin Getz

I'm porting a library from char to TCHAR. the count parameter of this fragment, according to MSDN, is the number of multibyte characters, not the number of bytes. so, did I get this right? _tcsncmp(access, TEXT("ftp"), 3); //or do i want _tcsnccmp? "Supported on Windows platforms only, _mbsncmp and _mbsnbcmp are multibyte versions of strncmp. _mbsncmp will compare at most count multibyte characters and _mbsnbcmp will compare at most count bytes. They both use the current multibyte code page. _tcsnccmp and _tcsncmp are the corresponding Generic functions for _mbsncmp and _mbsnbcmp, respectively. _tccmp is equivalent to _tcsnccmp."

Read the article

Culture Sensitive GetHashCode

- by user114928

Hi, I'm writing a c# application that will process some text and provide basic query functions. In order to ensure the best possible support for other languages, I am allowing the users of the application to specify the System.Globalization.CultureInfo (via the "en-GB" style code) and also the full range of collation options using the System.Globalization.CompareOptions flags enum. For regular string comparison I'm then using a combination of: a) String.Compare overload that accepts the culture and options b) For some bulk processes I'm caching the byte data (KeyData) from CompareInfo.GetSortKey (overload that accepts the options) and using a byte-by-byte comparison of the KeyData. This seemed fine (although please comment if you think these two methods shouldn't be mixed), but then I had reason to use the HashSet< class which only has an overload for IEqualityComparer<. MS documentation seems to suggest that I should use StringComparer (which implements both IEqualityComparer< and IComparer<), but this only seems to support the "IgnoreCase" option from CompareOptions and not "IgnoreKanaType", "IgnoreSymbols", "IgnoreWidth" etc. I'm assuming that a StringComparer that ignores these other options could produce different hashcodes for two strings that might be considered the same using my other comparison options. I'd therefore get incorrect results from my application. Only thought at the moment is to create my own IEqualityComparer< that generates a hashcode from the SortKey.KeyData and compares eqality be using the String.Compare overload. Any suggestions?

Read the article

Ruby character encoding problems in netabenas and command wíndow

- by salgo60

I use netbeans as development IDE and runs the application from cmd but have problems to display ISO 8859-1 characters like åäö correct in both cmd window and when I run the application from netbeans Question: What is best practice to set it up Right now I do @output.puts indent + "V" + 132.chr + "lkommen till Ruby Camping!" to get ä My environment chcp 65001 Active code page: 65001 ruby main.rb Source encoding: <Encoding:US-ASCII> Default external: #<Encoding:UTF-8> Default internal: nil Locale charmap: "CP65001" where I have in the code def self.printEncoding puts "Source encoding: #{__ENCODING__.inspect}" if defined? __ENCODING__ if defined? Environment::Encoding puts "Default external: #{Encoding.default_external.inspect}" puts "Default internal: #{Encoding.default_internal.inspect}" puts "Locale charmap: #{ Encoding.locale_charmap.inspect}" end puts "LANG environment variable: #{ENV['LANG'].inspect}" unless ENV['LANG'].nil? end ruby -v ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

Read the article

ajax(search suggest) funny character problem

- by Jason

ajax(search suggest), if input funny character(like Ô) to search it, "?" is displayed in firefox or empty box is displayed in IE. i am using xmlhttp.open("post", "*****.asp", true); xmlhttp.setRequestHeader('Content-type','application/x-www-form-urlencoded; charset=UTF-8'); and there is <%@CODEPAGE=65001%> in *****.asp file how can i fix it?

Read the article

Using Markov models to convert all caps to mixed case and related problems

- by hippietrail

I've been thinking about using Markov techniques to restore missing information to natural language text. Restore mixed case to text in all caps Restore accents / diacritics to languages which should have them but have been converted to plain ASCII Convert rough phonetic transcriptions back into native alphabets That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context. I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Markov chains to resolve the ambiguities. Am I on the right track? Are there already some services, libraries, or tools for this sort of thing? Examples GEORGE LOST HIS SIM CARD IN THE BUSH - George lost his SIM card in the bush tantot il rit a gorge deployee - tantôt il rit à gorge déployée

Read the article

What's the difference between utf8_general_ci and utf8_unicode_ci

- by KahWee

Between utf8_general_ci and utf8_unicode_ci, are there any differences in terms of performance?

Read the article

How to send parameters with same encoding from javascript?

- by nimcap

I have a javascript file that lots of people have embedded to their pages. Since I am hosting the file, I have control over that javascript file; I cannot control the way it is embedded because lots of people is using it already. This javascript file sends GET requests to my servlets, and the parameters passed with the request are recorded to DB. For example, javascript sends a request to http://myserver.com/servlet?p1=123&p2=aString and then servlet records 123 and aString to DB somehow. Before sending strings I use encodeURIComponent() to encode it. But what I figured out is every client sends the same string with different encodings depending on either their browser or the site they are visiting. As a result, same strings are represented with different characters when it reaches servlet (so they are different strings). What I am trying to do is to convert the strings to one kind of encoding from javascript so when they reach the client same words are represented with same characters. How is this possible? PS. If there is a way to convert the encoding from Java it is also applicable.

Read the article

Emailing HTML from within an iPhone app is stopping at special characters

- by user141146

Hi, I have an iPhone app that will let users email some pre-determined text as HTML. I'm having a problem in that if the text contains special characters within the text (e.g., ampersand &, , <), the NSString variable that I use for sending the body of the email gets truncated at the special character. I'm not sure how to fix this (I tried using the method stringByAddingPercentEscapesUsingEncoding…but this hasn't fixed the problems). Thoughts on what I'm doing wrong / how to fix it? Here is sample code showing what I'm trying to do Thanks!!! - (void)send_an_email:(id)sender { NSString *subject_string = [NSString stringWithFormat:@"Summary of %@", commercial_name]; NSString *body_string = [NSString stringWithFormat:@"%@<br /><br />", [self.dl email_message]]; // email_message returns the body of text that should be shipped as html. If email_message contains special characters, the text truncates at the special character NSString *full_string = [NSString stringWithFormat:@"mailto:?to=&subject=%@&body=%@", [subject_string stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding], [body_string stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding]]; [[UIApplication sharedApplication] openURL:[[NSURL alloc] initWithString:full_string]]; }

Read the article

In MySQL how can I tell what character set a particular table is using?

- by muudscope

I have a large mysql table that I think might be using the wrong character set. If so I'll need to change it using ALTER TABLE mytable CONVERT TO CHARACTER SET utf8 But since this is a very large table, I'd rather not run this command unless I have to. So my question is, how can I ask mysql what the character set is on a particular table? I can call status in mysql to see the database's character set, but that doesn't necessarily mean all the tables have the same character set, right?

Read the article

How to check if the word is Japanese or English using PHP

- by bn

I want to have different process for English word and Japanese word in this function function process_word($word) { if($word is english) { ///////// }else if($word is japanese) { //////// } } thank you

Read the article

Matching 'weird' characters in PHP regex

- by Bill X

I have some strings that need a-strippin': ÃœT: 9.996636,76.294363 Tons of long strings of location codes. A literal regex in PHP won't match them, IE $pattern = /ÃœT:/; echo preg_replace($pattern, "", $row['location']); Won't match/strip anything. (To know it's working, /T:/ does strip the last bit of that string). What's the encoding error doing on here? Alternately, I would accept a concise way to take out just the numbers.

Read the article

Google app engine error when I login.

- by zjm1126

i am using http://code.google.com/p/gaema/source/browse/#hg/demos/webapp, and this is my traceback: Traceback (most recent call last): File "D:\Program Files\Google\google_appengine\google\appengine\ext\webapp\__init__.py", line 510, in __call__ handler.get(*groups) File "D:\gaema\demos\webapp\main.py", line 31, in get google_auth.get_authenticated_user(self._on_auth) File "D:\gaema\demos\webapp\gaema\auth.py", line 641, in get_authenticated_user OpenIdMixin.get_authenticated_user(self, callback) File "D:\gaema\demos\webapp\gaema\auth.py", line 83, in get_authenticated_user url = self._OPENID_ENDPOINT + "?" + urllib.urlencode(args) File "D:\Python25\lib\urllib.py", line 1250, in urlencode v = quote_plus(str(v)) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) how to do this thanks updated i change the code from args = dict((k, v[-1]) for k, v in self.request.arguments.iteritems()) args["openid.mode"] = u"check_authentication" url = self._OPENID_ENDPOINT + "?" + urllib.urlencode(args) to args = dict((k, v[-1].encode('utf-8')) for k, v in self.request.arguments.iteritems()) args["openid.mode"] = u"check_authentication" url = self._OPENID_ENDPOINT + "?" + urllib.urlencode(args) but also error.

Search Results

Search found 1474 results on 59 pages for 'unicode'.

Page 23/59 | < Previous Page | 19 20 21 22 23 24 25 26 27 28 29 30 | Next Page >

- by Paul J. Lucas

- by cpps

- by Ninefingers

- by ChrisF

- by ordinarytoucan

- by erikcw

- by Constantin

- by Sejanus

- by salgo60

- by alex

- by DooriBar

- by richardtallent

- by Mystagogue

- by Dustin Getz

- by user114928

- by salgo60

- by Jason

- by hippietrail

- by KahWee

- by nimcap

- by user141146

- by muudscope

- by bn

- by Bill X

- by zjm1126

< Previous Page | 19 20 21 22 23 24 25 26 27 28 29 30 | Next Page >