Search Results

Search found 1649 results on 66 pages for 'unicode normalization'.

Page 26/66 | < Previous Page | 22 23 24 25 26 27 28 29 30 31 32 33  | Next Page >

  • String searching algorithm for Chinese characters.

    - by Jack Low
    There are Python code available for existing algorithms for normal string searching e.g. Boyer-Moore Algorithm. I am looking to use this on Chinese characters and it doesn't seem like the same implementation would work. What would I go about doing in order to make the algorithm work on Chinese characters? I am referring to this: http://en.literateprograms.org/Boyer-Moore_string_search_algorithm_(Python)#References

    Read the article

  • Python.expat can't parse XML file with bad symbols. How to go around?

    - by culebrón
    I'm trying to parse an XML file with expat, and here's the line where I get bad token exception: <tag k="name" v="???????????????????????????????????????????????????????????????????" /> xml.parsers.expat.ExpatError: not well-formed (invalid token): line 610127, column 37 The symbols in hex look like: \xd1? Seems like someone wrote this string (Russian alfabet) hitting backspace a few times. I set parser.returns_unicode = True, but this didn't help. The 1st line is <?xml version="1.0" encoding="UTF-8"?>. I work with a bz2 file. (bz2.BZ2File) How can I parse the file?

    Read the article

  • PHP: Cyrillic characters not displayed correctly

    - by user295502
    Recently I switched hosting from one provider to the other and I have problems displaying Cyrillic characters. The characters which are read from the database are displayed correctly, but characters which are hardcoded in the php file aren't (they are displayed as question marks). The files which contain the php source code are saved in utf-8 form. Help anybody?

    Read the article

  • A UnicodeDecodeError that occurs with json in python on Windows, but not Mac.

    - by ventolin
    On windows, I have the following problem: >>> string = "Don´t Forget To Breathe" >>> import json,os,codecs >>> f = codecs.open("C:\\temp.txt","w","UTF-8") >>> json.dump(string,f) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python26\lib\json\__init__.py", line 180, in dump for chunk in iterable: File "C:\Python26\lib\json\encoder.py", line 294, in _iterencode yield encoder(o) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data (Notice the non-ascii apostrophe in the string.) However, my friend, on his mac (also using python2.6), can run through this like a breeze: > string = "Don´t Forget To Breathe" > import json,os,codecs > f = codecs.open("/tmp/temp.txt","w","UTF-8") > json.dump(string,f) > f.close(); open('/tmp/temp.txt').read() '"Don\\u00b4t Forget To Breathe"' Why is this? I've also tried using UTF-16 and UTF-32 with json and codecs, but to no avail.

    Read the article

  • python read utf8 text file problem

    - by cpps
    I have a problem with python about reading and print utf8 text file. I have a test.txt in utf8 encoding without BOM. This file has two characters in it: ?? The first character "?" is Chinese and the second "?" is Japanese. Now, When I use Ulipad (a python editor) to run the following code to read the txt file, and print these two characters. import codecs infile = "C:\\test.txt" f = codecs.open(infile, "r", "utf-8") s = f.read() print(s) I got this error, "UnicodeEncodeError: 'cp950' codec can't encode character '\u58f0' in position 1: illegal multibyte sequence" I found it caused from the second character "?" . But when I use the same code to test in python default GUI IDLE, it works to print the two characters with no error. So, how can I fix the problem. My running environment is python 3.1 , windows xp traditional Chinese.

    Read the article

  • How can I use ToUnicode without breaking dead key support?

    - by Cypherjb
    A similar question has already been asked, so I'm not going to waste time re-explaining it, an existing discussion can be found here: http://stackoverflow.com/questions/1964614/toascii-tounicode-in-a-keyboard-hook-destroys-dead-keys The reason I'm posting a new question however is that I seem to have come across a 'solution', but I'm not quite sure how to implement it. This blog post seems to propose a solution to the problem of ToUnicode killing dead-key support: http://blogs.msdn.com/michkap/archive/2005/01/19/355870.aspx However I'm not sure how to implement the suggested solution. A push in the right direction would be greatly appreciated. To be clear, the part I'm referring to is this: "There are two ways to work around this: 1) You can keep calling ToUnicode with the same info until it is cleared out and then call it one more time to put the state back where it was if you had never typed anything, or 2) You can load all of the keyboard info ahead of time and then when they type information you can look up in your own info cache what the keystrokes mean, without having to call APIs later." I'm not quite sure how to do either of those things (keyboards and internationalization are far from my strong point), so any help would be greatly appreciated. Thanks

    Read the article

  • Amazon SQS invalid binary character in message body

    - by letronje
    I have a web app that sends messages to an Amazon SQS Queue. Amazon sqs lib throws a 'AmazonSQSException' since the message contained invalid binary character. The message is the referrer obtained from an incoming http request. This is what it looks like: http://ads.vrx.adbrite.com/adserver/display_iab_ads.php?sid=1220459&title_color=0000FF&text_color=000000&background_color=FFFFFF&border_color=CCCCCC&url_color=008000&newwin=0&zs=3330305f323530&width=300&height=250&url=http%3A%2F%2Funblockorkutproxy.com%2Fsearch.php%2FOi8vZG93%2FbmxvYWRz%2FLnppZGR1%2FLmNvbS9k%2Fb3dubG9h%2FZGZpbGUv%2FNTY5MTQ3%2FNi9NeUN1%2FdGVHaXJs%2FZnJpZW5k%2FWmFoaXJh%2FLndtdi5o%2FdG1s%2Fb0%2F^Fô}úÃ<99ë)j Looks like the characters in bold are the invalid characters. Is there an easy way to filter out characters characters that are not accepted by amazon ? Here are the characters allowed by amazon in message body. I am not sure what regex i should use to replace invalid characters by ''

    Read the article

  • Code to strip diacritical marks using ICU

    - by Paul J. Lucas
    Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é would become a plain ASCII e) from a UnicodeString using the ICU library in C++? E.g.: UnicodeString strip_diacritics( UnicodeString const &s ) { UnicodeString result; // ... return result; } Assume that s has already been normalized. Thanks.

    Read the article

  • How do I create JavaScript escape sequences in PHP?

    - by ordinarytoucan
    I'm looking for a way to create valid UTF-16 JavaScript escape sequence characters (including surrogate pairs) from within PHP. I'm using the code below to get the UTF-32 code points (from a UTF-8 encoded character). This works as JavaScript escape characters (eg. '\u00E1' for 'á') - until you get into the upper ranges where you get surrogate pairs (eg '??' comes out as '\u1D715' but should be '\uD835\uDF15')... function toOrdinal($chr) { if (ord($chr{0}) >= 0 && ord($chr{0}) <= 127) { return ord($chr{0}); } elseif (ord($chr{0}) >= 192 && ord($chr{0}) <= 223) { return (ord($chr{0}) - 192) * 64 + (ord($chr{1}) - 128); } elseif (ord($chr{0}) >= 224 && ord($chr{0}) <= 239) { return (ord($chr{0}) - 224) * 4096 + (ord($chr{1}) - 128) * 64 + (ord($chr{2}) - 128); } elseif (ord($chr{0}) >= 240 && ord($chr{0}) <= 247) { return (ord($chr{0}) - 240) * 262144 + (ord($chr{1}) - 128) * 4096 + (ord($chr{2}) - 128) * 64 + (ord($chr{3}) - 128); } elseif (ord($chr{0}) >= 248 && ord($chr{0}) <= 251) { return (ord($chr{0}) - 248) * 16777216 + (ord($chr{1}) - 128) * 262144 + (ord($chr{2}) - 128) * 4096 + (ord($chr{3}) - 128) * 64 + (ord($chr{4}) - 128); } elseif (ord($chr{0}) >= 252 && ord($chr{0}) <= 253) { return (ord($chr{0}) - 252) * 1073741824 + (ord($chr{1}) - 128) * 16777216 + (ord($chr{2}) - 128) * 262144 + (ord($chr{3}) - 128) * 4096 + (ord($chr{4}) - 128) * 64 + (ord($chr{5}) - 128); } } How do I adapt this code to give me proper UTF-16 code points? Thanks!

    Read the article

  • "É" not getting converted to two bytes correctly.

    - by ChrisF
    Further to this question I've got a supplementary problem. I've found a track with an "É" in the title. My code: var playList = new StreamWriter(playlist, false, Encoding.UTF8); - private static void WriteUTF8(StreamWriter playList, string output) { byte[] byteArray = Encoding.UTF8.GetBytes(output); foreach (byte b in byteArray) { playList.Write(Convert.ToChar(b)); } } converts this to the following bytes: 195 137 which is being output as à followed by a square (which is an character that can't be printed in the current font). I've exported the same file to a playlist in Media Monkey at it writes the "É" as "É" - which I'm assuming is correct (as KennyTM pointed out). My question is, how do I get the "‰" symbol output? Do I need to select a different font and if so which one? UPDATE People seem to be missing the point. I can get the "É" written to the file using playList.WriteLine("É"); that's not the problem. The problem is that Media Monkey requires the file to be in the following format: #EXTINFUTF8:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi #EXTINF:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi #UTF8:04-Comptine D'Un Autre Été- L'Après Midi.mp3 04-Comptine D'Un Autre Été- L'Après Midi.mp3 Where all the "high-ascii" (for want of a better term) are written out as a pair of characters.

    Read the article

  • java: can I convert strings to byte arrays, without a BOM?

    - by Cheeso
    Suppose I have this code: String encoding = "UTF-16"; String text = "[Hello StackOverflow]"; byte[] message= text.getBytes(encoding); If I display the byte array in message, the result is: 0000 FE FF 00 5B 00 48 00 65 00 6C 00 6C 00 6F 00 20 ...[.H.e.l.l.o. 0010 00 53 00 74 00 61 00 63 00 6B 00 4F 00 76 00 65 .S.t.a.c.k.O.v.e 0020 00 72 00 66 00 6C 00 6F 00 77 00 5D .r.f.l.o.w.] As you can see, there's a BOM in the beginning. How can I: generate a UTF-16 byte array that lacks a BOM, from a string? convert from a byte array that contains UTF-16 chars but lacks a BOM, back to a string?

    Read the article

  • python unichr problem

    - by jacob
    I've got some problem with unichr() on my server. Please see below: On my server (Ubuntu 9.04): >>> print unichr(255) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 0: ordinal not in range(128) On my desktop (Ubuntu 9.10): >>> print unichr(255) ÿ I'm fairly new to python so I don't know how to solve this. Anyone care to help? Thanks.

    Read the article

  • to escape or not to escape: well formed XHTML with diacritics

    - by andresmh
    Say that you have a XHTML document in English but it has accented characters (e.g. meta name="author" content="José"). Let's say you have no control over the HTTP headers. Should the characters be replaced for their corresponding named entities (e.g. &aacute;, etc)? Should the doc type and the xml:lang attribute be set to English? I know I can check the W3C recommendation but I am asking more from a practical point of view.

    Read the article

  • Why isn't wchar_t widely used in code for Linux / related platforms?

    - by Ninefingers
    This intrigues me, so I'm going to ask - for what reason is wchar_t not used so widely on Linux/Linux-like systems as it is on Windows? Specifically, the Windows API uses wchar_t internally whereas I believe Linux does not and this is reflected in a number of open source packages using char types. My understanding is that given a character c which requires multiple bytes to represent it, then in a char[] form c is split over several parts of char* whereas it forms a single unit in wchar_t[]. Is it not easier, then, to use wchar_t always? Have I missed a technical reason that negates this difference? Or is it just an adoption problem?

    Read the article

  • How to generate pdf files _with_ utf-8 multibyte characters using Zend Framework

    - by Sejanus
    Hello, I've got a "little" problem with Zend Framework Zend_Pdf class. Multibyte characters are stripped from generated pdf files. E.g. when I write aabccdee it becomes abcd with lithuanian letters stripped. I'm not sure if it's particularly Zend_Pdf problem or php in general. Source text is encoded in utf-8, as well as the php source file which does the job. Thank you in advance for your help ;) P.S. I run Zend Framework v. 1.6 and I use FONT_TIMES_BOLD font. FONT_TIMES_ROMAN does work

    Read the article

  • Ruby character encoding problems in netbeans and command wíndow

    - by salgo60
    I use netbeans as development IDE and runs the application from cmd but have problems to display ISO 8859-1 characters like åäö correct in both cmd window and when I run the application from netbeans Question: What is best practice to set it up Right now I do @output.puts indent + "V" + 132.chr + "lkommen till Ruby Camping!" to get ä My environment chcp 65001 Active code page: 65001 ruby main.rb Source encoding: <Encoding:US-ASCII> Default external: #<Encoding:UTF-8> Default internal: nil Locale charmap: "CP65001" where I have in the code def self.printEncoding puts "Source encoding: #{__ENCODING__.inspect}" if defined? __ENCODING__ if defined? Environment::Encoding puts "Default external: #{Encoding.default_external.inspect}" puts "Default internal: #{Encoding.default_internal.inspect}" puts "Locale charmap: #{ Encoding.locale_charmap.inspect}" end puts "LANG environment variable: #{ENV['LANG'].inspect}" unless ENV['LANG'].nil? end ruby -v ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

    Read the article

  • Space-saving character encoding for japanese?

    - by Constantin
    In my opinion a common problem: character encoding in combination with a bitmap-font. Most multi-language encodings have an huge space between different character types and even a lot of unused code points there. So if I want to use them I waste a lot of memory (not only for saving multi-byte text - i mean specially for spaces in my bitmap-font) - and VRAM is mostly really valuable... So the only reasonable thing seems to be: Using an custom mapping on my texture for i.e. UTF-8 characters (so that no space is waste). BUT: This effort seems to be same with use an own proprietary character encoding (so also own order of characters in my texture). In my specially case I got texture space for 4096 different characters and need characters to display latin languages as well as japanese (its a mess with utf-8 that only support generall cjk codepages). Had somebody ever a similiar problem (I really wonder, if not)? If theres already any approach? Edit: The same Problem is described here http://www.tonypottier.info/Unicode_And_Japanese_Kanji/ but it doesnt provide an real solution how to save these bitmapfont mappings to utf-8 space efficent. So any further help is welcome!

    Read the article

  • ajax(search suggest) funny character problem

    - by Jason
    ajax(search suggest), if input funny character(like Ô) to search it, "?" is displayed in firefox or empty box is displayed in IE. i am using xmlhttp.open("post", "*****.asp", true); xmlhttp.setRequestHeader('Content-type','application/x-www-form-urlencoded; charset=UTF-8'); and there is <%@CODEPAGE=65001%> in *****.asp file how can i fix it?

    Read the article

  • Ruby character encoding problems in netabenas and command wíndow

    - by salgo60
    I use netbeans as development IDE and runs the application from cmd but have problems to display ISO 8859-1 characters like åäö correct in both cmd window and when I run the application from netbeans Question: What is best practice to set it up Right now I do @output.puts indent + "V" + 132.chr + "lkommen till Ruby Camping!" to get ä My environment chcp 65001 Active code page: 65001 ruby main.rb Source encoding: <Encoding:US-ASCII> Default external: #<Encoding:UTF-8> Default internal: nil Locale charmap: "CP65001" where I have in the code def self.printEncoding puts "Source encoding: #{__ENCODING__.inspect}" if defined? __ENCODING__ if defined? Environment::Encoding puts "Default external: #{Encoding.default_external.inspect}" puts "Default internal: #{Encoding.default_internal.inspect}" puts "Locale charmap: #{ Encoding.locale_charmap.inspect}" end puts "LANG environment variable: #{ENV['LANG'].inspect}" unless ENV['LANG'].nil? end ruby -v ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

    Read the article

  • tchar safe functions -- count parameter for UTF-8 constants

    - by Dustin Getz
    I'm porting a library from char to TCHAR. the count parameter of this fragment, according to MSDN, is the number of multibyte characters, not the number of bytes. so, did I get this right? _tcsncmp(access, TEXT("ftp"), 3); //or do i want _tcsnccmp? "Supported on Windows platforms only, _mbsncmp and _mbsnbcmp are multibyte versions of strncmp. _mbsncmp will compare at most count multibyte characters and _mbsnbcmp will compare at most count bytes. They both use the current multibyte code page. _tcsnccmp and _tcsncmp are the corresponding Generic functions for _mbsncmp and _mbsnbcmp, respectively. _tccmp is equivalent to _tcsnccmp."

    Read the article

< Previous Page | 22 23 24 25 26 27 28 29 30 31 32 33  | Next Page >