Search Results

Search found 1649 results on 66 pages for 'unicode normalization'.

Page 1/66 | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Unicode license

    - by Eric Grange
    Unicode Terms of use (http://www.unicode.org/copyright.html) state that any software that uses their data files (or a modification of it) should carry the Unicode license references. It seems to me that most Unicode libraries have functions to check if a character is a digit, a letter, a symbol, etc. ans so will contain a modification of the Unicode Data Files (usually in the form of tables). Does that mean the license applies and all applications that use such Unicode libraries should carry the license? I've checked around, and it appears very few Unicode software do carry the license, though arguable most of those that didn't carry the license were from companies that were members of the Unicode consortium (do they get license exemption?). Some (f.i. Mozilla) are only "Liaison Members", and while their software do not carry the license (AFAICT), they do obviously rely on data derived from those data files. Is Mozilla in breach of the license? Should we carry the license in all apps that include any form of advanced Unicode support? (ie. are bound to rely on the Unicode data files) Or is there some form of broad exemption? (since very very few software out there carries the license)

    Read the article

  • How to display multiple unicode text files in non-unicode program

    - by Stan
    OS:WinXP Say I got some files in Chinese and some files in Korean. And in windows 'Region and Language Options', I set language for non-unicode program = Chinese. Is there any way that I can read some Korean text file in text editor easily without using Microsoft Word? I need an environment that can support multiple unicode easily, I need to read Chinese, Japanese, Korean in text editor (Ultraeditor, notepad++) and terminal clients like SecureCRT. Please advise, thanks.

    Read the article

  • How do I properly implement Unicode passwords?

    - by Sorin Sbarnea
    Adding support for Unicode passwords it an important feature that should not be ignored by the developpers. Still adding support for Unicode in the passwords it's a tricky job because the same text can be encoded in different ways in Unicode and this is not something you may want to prevent people from logging in due to this. Let's say that you'll store the passwords os UTF-8. Now the question is how you should normalize the Unicode data? You had to be sure that you'll be able to compare it. You need to be sure that when the next Unicode standard will be released it will not invalidate your password verification. Note: still there are some places where Unicode passwords are probably never be used, but this question is not about why or when to use Unicode passwords, is about how to implement them the proper way.

    Read the article

  • Python and Unicode: How everything should be Unicode

    - by A A
    Forgive if this a long a question: I have been programming in Python for around six months. Self taught, starting with the Python tutorial and then SO and then just using Google for stuff. Here is the sad part: No one told me all strings should be Unicode. No, I am not lying or making this up, but where does the tutorial mention it? And most examples also I see just make use of byte strings, instead of Unicode strings. I was just browsing and came across this question on SO, which says how every string in Python should be a Unicode string. This pretty much made me cry! I read that every string in Python 3.0 is Unicode by default, so my questions are for 2.x: Should I do a: print u'Some text' or just print 'Text' ? Everything should be Unicode, does this mean, like say I have a tuple: t = ('First', 'Second'), it should be t = (u'First', u'Second')? I read that I can do a from __future__ import unicode_literals and then every string will be a Unicode string, but should I do this inside a container also? When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required? If I get the string from say raw_input(), should I convert that to Unicode also? What is the common approach to handling all of the above issues in 2.x? The from __future__ import unicode_literals statement? Sorry for being a such a noob, but this changes what I have been doing for a long time and so clearly I am confused.

    Read the article

  • SQLite, python, unicode, and non-utf data

    - by Nathan Spears
    I started by trying to store strings in sqlite using python, and got the message: sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. Ok, I switched to Unicode strings. Then I started getting the message: sqlite3.OperationalError: Could not decode to UTF-8 column 'tag_artist' with text 'Sigur Rós' when trying to retrieve data from the db. More research and I started encoding it in utf8, but then 'Sigur Rós' starts looking like 'Sigur Rós' note: My console was set to display in 'latin_1' as @John Machin pointed out. What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all. I didn't know much about unicode and utf before I started this process. I've learned quite a bit in the last couple hours, but I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it. If there isn't, why would sqlite 'highly recommend' I switch my application to unicode strings? I'm going to update this question with a summary and some example code of everything I've learned in the last 24 hours so that someone in my shoes can have an easy(er) guide. If the information I post is wrong or misleading in any way please tell me and I'll update, or one of you senior guys can update. Summary of answers Let me first state the goal as I understand it. The goal in processing various encodings, if you are trying to convert between them, is to understand what your source encoding is, then convert it to unicode using that source encoding, then convert it to your desired encoding. Unicode is a base and encodings are mappings of subsets of that base. utf_8 has room for every character in unicode, but because they aren't in the same place as, for instance, latin_1, a string encoded in utf_8 and sent to a latin_1 console will not look the way you expect. In python the process of getting to unicode and into another encoding looks like: str.decode('source_encoding').encode('desired_encoding') or if the str is already in unicode str.encode('desired_encoding') For sqlite I didn't actually want to encode it again, I wanted to decode it and leave it in unicode format. Here are four things you might need to be aware of as you try to work with unicode and encodings in python. The encoding of the string you want to work with, and the encoding you want to get it to. The system encoding. The console encoding. The encoding of the source file Elaboration: (1) When you read a string from a source, it must have some encoding, like latin_1 or utf_8. In my case, I'm getting strings from filenames, so unfortunately, I could be getting any kind of encoding. Windows XP uses UCS-2 (a Unicode system) as its native string type, which seems like cheating to me. Fortunately for me, the characters in most filenames are not going to be made up of more than one source encoding type, and I think all of mine were either completely latin_1, completely utf_8, or just plain ascii (which is a subset of both of those). So I just read them and decoded them as if they were still in latin_1 or utf_8. It's possible, though, that you could have latin_1 and utf_8 and whatever other characters mixed together in a filename on Windows. Sometimes those characters can show up as boxes, other times they just look mangled, and other times they look correct (accented characters and whatnot). Moving on. (2) Python has a default system encoding that gets set when python starts and can't be changed during runtime. See here for details. Dirty summary ... well here's the file I added: \# sitecustomize.py \# this file can be anywhere in your Python path, \# but it usually goes in ${pythondir}/lib/site-packages/ import sys sys.setdefaultencoding('utf_8') This system encoding is the one that gets used when you use the unicode("str") function without any other encoding parameters. To say that another way, python tries to decode "str" to unicode based on the default system encoding. (3) If you're using IDLE or the command-line python, I think that your console will display according to the default system encoding. I am using pydev with eclipse for some reason, so I had to go into my project settings, edit the launch configuration properties of my test script, go to the Common tab, and change the console from latin-1 to utf-8 so that I could visually confirm what I was doing was working. (4) If you want to have some test strings, eg test_str = "ó" in your source code, then you will have to tell python what kind of encoding you are using in that file. (FYI: when I mistyped an encoding I had to ctrl-Z because my file became unreadable.) This is easily accomplished by putting a line like so at the top of your source code file: # -*- coding: utf_8 -*- If you don't have this information, python attempts to parse your code as ascii by default, and so: SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details Once your program is working correctly, or, if you aren't using python's console or any other console to look at output, then you will probably really only care about #1 on the list. System default and console encoding are not that important unless you need to look at output and/or you are using the builtin unicode() function (without any encoding parameters) instead of the string.decode() function. I wrote a demo function I will paste into the bottom of this gigantic mess that I hope correctly demonstrates the items in my list. Here is some of the output when I run the character 'ó' through the demo function, showing how various methods react to the character as input. My system encoding and console output are both set to utf_8 for this run: '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Now I will change the system and console encoding to latin_1, and I get this output for the same input: 'ó' = original char <type 'str'> repr(char)='\xf3' 'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Notice that the 'original' character displays correctly and the builtin unicode() function works now. Now I change my console output back to utf_8. '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' '?' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Here everything still works the same as last time but the console can't display the output correctly. Etc. The function below also displays more information that this and hopefully would help someone figure out where the gap in their understanding is. I know all this information is in other places and more thoroughly dealt with there, but I hope that this would be a good kickoff point for someone trying to get coding with python and/or sqlite. Ideas are great but sometimes source code can save you a day or two of trying to figure out what functions do what. Disclaimers: I'm no encoding expert, I put this together to help my own understanding. I kept building on it when I should have probably started passing functions as arguments to avoid so much redundant code, so if I can I'll make it more concise. Also, utf_8 and latin_1 are by no means the only encoding schemes, they are just the two I was playing around with because I think they handle everything I need. Add your own encoding schemes to the demo function and test your own input. One more thing: there are apparently crazy application developers making life difficult in Windows. #!/usr/bin/env python # -*- coding: utf_8 -*- import os import sys def encodingDemo(str): validStrings = () try: print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str)) validStrings += ((str,""),) except UnicodeEncodeError as ude: print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print ude try: x = unicode(str) print "unicode(str) = ",x validStrings+= ((x, " decoded into unicode by the default system encoding"),) except UnicodeDecodeError as ude: print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string." print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()), print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('latin_1') print "str.decode('latin_1') =",x validStrings+= ((x, " decoded with latin_1 into unicode"),) try: print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8') validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),) except UnicodeDecodeError as ude: print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t", print ude except UnicodeDecodeError as ude: print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('utf_8') print "str.decode('utf_8') =",x validStrings+= ((x, " decoded with utf_8 into unicode"),) try: print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1') except UnicodeDecodeError as ude: print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t", validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),) print ude except UnicodeDecodeError as ude: print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee print print "Printing information about each character in the original string." for char in str: try: print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char)) except UnicodeDecodeError as ude: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude) except UnicodeEncodeError as uee: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee) print uee try: x = unicode(char) print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = unicode(char) ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('latin_1') print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('utf_8') print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) print x = 'ó' encodingDemo(x) Much thanks for the answers below and especially to @John Machin for answering so thoroughly.

    Read the article

  • SQLite, python, unicode, and non-utf data

    - by Nathan Spears
    I started by trying to store strings in sqlite using python, and got the message: sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. Ok, I switched to Unicode strings. Then I started getting the message: sqlite3.OperationalError: Could not decode to UTF-8 column 'tag_artist' with text 'Sigur Rós' when trying to retrieve data from the db. More research and I started encoding it in utf8, but then 'Sigur Rós' starts looking like 'Sigur Rós' note: My console was set to display in 'latin_1' as @John Machin pointed out. What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all. I didn't know much about unicode and utf before I started this process. I've learned quite a bit in the last couple hours, but I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it. If there isn't, why would sqlite 'highly recommend' I switch my application to unicode strings? I'm going to update this question with a summary and some example code of everything I've learned in the last 24 hours so that someone in my shoes can have an easy(er) guide. If the information I post is wrong or misleading in any way please tell me and I'll update, or one of you senior guys can update. Summary of answers Let me first state the goal as I understand it. The goal in processing various encodings, if you are trying to convert between them, is to understand what your source encoding is, then convert it to unicode using that source encoding, then convert it to your desired encoding. Unicode is a base and encodings are mappings of subsets of that base. utf_8 has room for every character in unicode, but because they aren't in the same place as, for instance, latin_1, a string encoded in utf_8 and sent to a latin_1 console will not look the way you expect. In python the process of getting to unicode and into another encoding looks like: str.decode('source_encoding').encode('desired_encoding') or if the str is already in unicode str.encode('desired_encoding') For sqlite I didn't actually want to encode it again, I wanted to decode it and leave it in unicode format. Here are four things you might need to be aware of as you try to work with unicode and encodings in python. The encoding of the string you want to work with, and the encoding you want to get it to. The system encoding. The console encoding. The encoding of the source file Elaboration: (1) When you read a string from a source, it must have some encoding, like latin_1 or utf_8. In my case, I'm getting strings from filenames, so unfortunately, I could be getting any kind of encoding. Windows XP uses UCS-2 (a Unicode system) as its native string type, which seems like cheating to me. Fortunately for me, the characters in most filenames are not going to be made up of more than one source encoding type, and I think all of mine were either completely latin_1, completely utf_8, or just plain ascii (which is a subset of both of those). So I just read them and decoded them as if they were still in latin_1 or utf_8. It's possible, though, that you could have latin_1 and utf_8 and whatever other characters mixed together in a filename on Windows. Sometimes those characters can show up as boxes, other times they just look mangled, and other times they look correct (accented characters and whatnot). Moving on. (2) Python has a default system encoding that gets set when python starts and can't be changed during runtime. See here for details. Dirty summary ... well here's the file I added: \# sitecustomize.py \# this file can be anywhere in your Python path, \# but it usually goes in ${pythondir}/lib/site-packages/ import sys sys.setdefaultencoding('utf_8') This system encoding is the one that gets used when you use the unicode("str") function without any other encoding parameters. To say that another way, python tries to decode "str" to unicode based on the default system encoding. (3) If you're using IDLE or the command-line python, I think that your console will display according to the default system encoding. I am using pydev with eclipse for some reason, so I had to go into my project settings, edit the launch configuration properties of my test script, go to the Common tab, and change the console from latin-1 to utf-8 so that I could visually confirm what I was doing was working. (4) If you want to have some test strings, eg test_str = "ó" in your source code, then you will have to tell python what kind of encoding you are using in that file. (FYI: when I mistyped an encoding I had to ctrl-Z because my file became unreadable.) This is easily accomplished by putting a line like so at the top of your source code file: # -*- coding: utf_8 -*- If you don't have this information, python attempts to parse your code as ascii by default, and so: SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details Once your program is working correctly, or, if you aren't using python's console or any other console to look at output, then you will probably really only care about #1 on the list. System default and console encoding are not that important unless you need to look at output and/or you are using the builtin unicode() function (without any encoding parameters) instead of the string.decode() function. I wrote a demo function I will paste into the bottom of this gigantic mess that I hope correctly demonstrates the items in my list. Here is some of the output when I run the character 'ó' through the demo function, showing how various methods react to the character as input. My system encoding and console output are both set to utf_8 for this run: '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Now I will change the system and console encoding to latin_1, and I get this output for the same input: 'ó' = original char <type 'str'> repr(char)='\xf3' 'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Notice that the 'original' character displays correctly and the builtin unicode() function works now. Now I change my console output back to utf_8. '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' '?' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Here everything still works the same as last time but the console can't display the output correctly. Etc. The function below also displays more information that this and hopefully would help someone figure out where the gap in their understanding is. I know all this information is in other places and more thoroughly dealt with there, but I hope that this would be a good kickoff point for someone trying to get coding with python and/or sqlite. Ideas are great but sometimes source code can save you a day or two of trying to figure out what functions do what. Disclaimers: I'm no encoding expert, I put this together to help my own understanding. I kept building on it when I should have probably started passing functions as arguments to avoid so much redundant code, so if I can I'll make it more concise. Also, utf_8 and latin_1 are by no means the only encoding schemes, they are just the two I was playing around with because I think they handle everything I need. Add your own encoding schemes to the demo function and test your own input. One more thing: there are apparently crazy application developers making life difficult in Windows. #!/usr/bin/env python # -*- coding: utf_8 -*- import os import sys def encodingDemo(str): validStrings = () try: print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str)) validStrings += ((str,""),) except UnicodeEncodeError as ude: print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print ude try: x = unicode(str) print "unicode(str) = ",x validStrings+= ((x, " decoded into unicode by the default system encoding"),) except UnicodeDecodeError as ude: print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string." print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()), print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('latin_1') print "str.decode('latin_1') =",x validStrings+= ((x, " decoded with latin_1 into unicode"),) try: print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8') validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),) except UnicodeDecodeError as ude: print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t", print ude except UnicodeDecodeError as ude: print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('utf_8') print "str.decode('utf_8') =",x validStrings+= ((x, " decoded with utf_8 into unicode"),) try: print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1') except UnicodeDecodeError as ude: print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t", validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),) print ude except UnicodeDecodeError as ude: print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee print print "Printing information about each character in the original string." for char in str: try: print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char)) except UnicodeDecodeError as ude: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude) except UnicodeEncodeError as uee: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee) print uee try: x = unicode(char) print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = unicode(char) ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('latin_1') print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('utf_8') print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) print x = 'ó' encodingDemo(x) Much thanks for the answers below and especially to @John Machin for answering so thoroughly.

    Read the article

  • Unicode issue in Django

    - by Kave
    I seem to have a unicode problem with the deal_instance_name in the Deal model. It says: coercing to Unicode: need string or buffer, __proxy__ found The exception happens on this line: return smart_unicode(self.deal_type.deal_name) + _(u' - Set No.') + str(self.set) The line works if I remove smart_unicode(self.deal_type.deal_name) but why? Back then in Django 1.1 someone had the same problem on Stackoverflow I have tried both the unicode() as well as the new smart_unicode() without any joy. What could I be missing please? class Deal(models.Model): def __init__(self, *args, **kwargs): super(Deal, self).__init__(*args, **kwargs) self.deal_instance_name = self.__unicode__() deal_type = models.ForeignKey(DealType) deal_instance_name = models.CharField(_(u'Deal Name'), max_length=100) set = models.IntegerField(_(u'Set Number')) def __unicode__(self): return smart_unicode(self.deal_type.deal_name) + _(u' - Set No.') + str(self.set) class Meta: verbose_name = _(u'Deal') verbose_name_plural = _(u'Deals') Dealtype: class DealType(models.Model): deal_name = models.CharField(_(u'Deal Name'), max_length=40) deal_description = models.TextField(_(u'Deal Description'), blank=True) def __unicode__(self): return smart_unicode(self.deal_name) class Meta: verbose_name = _(u'Deal Type') verbose_name_plural = _(u'Deal Types')

    Read the article

  • Is data integrity possible without normalization?

    - by shuniar
    I am working on an application that requires the storage of location information such as city, state, zip code, latitude, and longitude. I would like to ensure: Location data is accurate Detroit, CA Detroit IS NOT in California Detroit, MI Detroit IS in Michigan Cities and states are spelled correctly California not Calefornia Detroit not Detriot Cities and states are named consistently Valid: CA Detroit Invalid: Cali california DET d-town The D Also, since city/zip data is not guaranteed to be static, updating this data in a normalized fashion could be difficult, whereas it could be implemented as a de facto location if it is denormalized. A couple thoughts that come to mind: A collection of reference tables that store a list of all states and the most common cities and zip codes that can grow over time. It would search the database for an exact or similar match and recommend corrections. Use some sort of service to validate the location data before it is stored in the database. Is it possible to fulfill these requirements without normalization, and if so, should I denormalize this data?

    Read the article

  • On Windows 7, dir or tree can't show unicode characters, even starting cmd with cmd /U

    - by Jian Lin
    On Windows 7, dir or tree can't show unicode characters, even starting cmd with cmd /U So I would press Window Key + R to run something, and type in cmd /U so that the content might handle Unicode. And then using dir or tree /F, the content in Unicode won't show as Unicode. (in Window Explorer (file manager), the Unicode will show) Is there a way to handle it? To get Unicode characters to test your filenames, you can go to http://news.google.com/news?edchanged=1&ned=tw and you will be able to get many Unicode characters there (UTF-8)

    Read the article

  • On Windows 7, dir or tree can't show unicode characters, even starting cmd with cmd /U

    - by ????
    On Windows 7, dir or tree can't show unicode characters, even starting cmd with cmd /U So I would press Window Key + R to run something, and type in cmd /U so that the content might handle Unicode. And then using dir or tree /F, the content in Unicode won't show as Unicode. (in Window Explorer (file manager), the Unicode will show) Is there a way to handle it? To get Unicode characters to test your filenames, you can go to http://news.google.com/news?edchanged=1&ned=tw and you will be able to get many Unicode characters there (UTF-8)

    Read the article

  • Validate Unicode String and Escape if Unicode is Invalid (C/C++)

    - by vy32
    I have a program that reads arbitrary data from a file system and outputs results in Unicode. The problem I am having is that sometimes filenames are valid Unicode and sometimes they aren't. So I want a function that can validate a string (in C or C++) and tell me if it is a valid UTF-8 encoding. If it is not, I want to have the invalid characters escaped so that it will be a valid UTF-8 encoding. This is different than escaping for XML --- I need to do that also. But first I need to be sure that the Unicode is right. I've seen some code from which I could hack this, but I would rather use some working code if it exists.

    Read the article

  • Delphi Unicode String Type Stored Directly at its Address (or "Unicode ShortString")

    - by Andreas Rejbrand
    I want a string type that is Unicode and that stores the string directly at the adress of the variable, as is the case of the (Ansi-only) ShortString type. I mean, if I declare a S: ShortString and let S := 'My String', then, at @S, I will find the length of the string (as one byte, so the string cannot contain more than 255 characters) followed by the ANSI-encoded string itself. What I would like is a Unicode variant of this. That is, I want a string type such that, at @S, I will find a unsigned 32-bit integer (or a single byte would be enough, actually) containing the length of the string in bytes (or in characters, which is half the number of bytes) followed by the Unicode representation of the string. I have tried WideString, UnicodeString, and RawByteString, but they all appear only to store an adress at @S, and the actual string somewhere else (I guess this has do do with reference counting and such). Update: The most important reason for this is probably that it would be very problematic if sizeof(string) were variable. I suspect that there is no built-in type to use, and that I have to come up with my own way of storing text the way I want (which actually is fun). Am I right? Update I will, among other things, need to use these strings in packed records. I also need manually to read/write these strings to files/the heap. I could live with fixed-size strings, such as <= 128 characters, and I could redesign the problem so it will work with null-terminated strings. But PChar will not work, for sizeof(PChar) = 1 - it's merely an address. The approach I eventually settled for was to use a static array of bytes. I will post my implementation as a solution later today.

    Read the article

  • How to convert a unicode charactor array back to unicode sequence in C++

    - by eddyxd
    My problem is how to convert a c/c++ string/chractor array to another string contain the unicode(UTF-16) escape sequence of original one for example, I want to find a function F(char *ch) could do following function. char a[10] = "\u5f53"; printf("a = %s\n",a); char b[10]; b = F(a); //<- F is the function I wanted printf("b = %s\n",b); -------- console will show ------- a = ? b = \u5f53 Anyone has any Idea@@?~ thanks!! ps: I tried to guess \u5f35 means the value store in a, but it is not indeed the value of a[0] = -79 , a[1] = 105 ... So I don't know how to convert it back to the sequence of unicode.... Please give me a hane~ : )

    Read the article

  • How to unicode Myanmar texts on Java? [closed]

    - by Spacez Ly Wang
    I'm just beginner of Java. I'm trying to unicode (display) correctly Myanmar texts on Java GUI ( Swing/Awt ). I have four TrueType fonts which support Myanmar unicode texts. There are Myanmar3, Padauk, Tharlon, Myanmar Text ( Window 8 built-in ). You may need the fonts before the code. Google the fonts, please. Each of the fonts display on Java GUI differently and incorrectly. Here is the code for GUI Label displaying myanmar texts: ++++++++++++++++++++++++ package javaapplication1; import javax.swing.JFrame; import javax.swing.JTextField; public class CusFrom { private static void createAndShowGUI() { JFrame frame = new JFrame("Hello World Swing"); frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); String s = "\u1015\u102F \u103C\u1015\u102F"; JLabel label = new JLabel(s); label.setFont(new java.awt.Font("Myanmar3", 0, 20));// font insert here, Myanmar Text, Padauk, Myanmar3, Tharlon frame.getContentPane().add(label); frame.pack(); frame.setVisible(true); } public static void main(String[] args) { javax.swing.SwingUtilities.invokeLater(new Runnable() { public void run() { createAndShowGUI(); } }); } } ++++++++++++++++++++++++ Outputs vary. See the pictures: Myanmar3 IMG Padauk IMG Tharlon IMG Myanmar Text IMG What is the correct form? (on notepad) Well, next is the code for GUI Textfield inputting Myanmar texts: ++++++++++++++++++++++++ package javaapplication1; import javax.swing.JFrame; import javax.swing.JTextField; public class XusForm { private static void createAndShowGUI() { JFrame frame = new JFrame("Frame Title"); frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); JTextField textfield = new JTextField(); textfield.setFont(new java.awt.Font("Myanmar3", 0, 20)); frame.getContentPane().add(textfield); frame.pack(); frame.setVisible(true); } public static void main(String[] args) { javax.swing.SwingUtilities.invokeLater(new Runnable() { public void run() { createAndShowGUI(); } }); } } ++++++++++++++++++++++++ Outputs vary when I input keys( unicode text ) on keyboards. Myanmar Text Output IMG Padauk Output IMG Myanmar3 Output IMG Tharlon Output IMG Those fonts work well on Linux when opening text files with Text Editor application. My Question is how to unicode Myanmar texts on Java GUI. Do I need additional codes left to display well? Or Does Java still have errors? The fonts display well on Web Application (HTML, CSS) but I'm not sure about displaying on Java Web Application.

    Read the article

  • Unicode fonts render incorrectly in Terminal

    - by Sridher
    My Ubuntu 13.04 terminal renders Unicode Indic fonts incorrectly, but they are rendered correctly in Firefox, gedit, Chrome etc. How can I fix this? Works fine in: Firefox, Chrome, Gedit, Open Office Not working in: Terminal UPDATE : Here is the screenshot from my desktop showing the telugu font rendering in various applications (including my sample pygame example) note : pygame unicode, console renders wrong and same but rest of the apps correct

    Read the article

  • Unicode support between different OS and browsers

    - by Martin Trigaux
    I would like to develop a web application that uses unicode. The problem is that I don't know if the user supports or not the full unicode set. First question : is the unicode support depends on the browser or the operating system ? Second question : how well main browsers/OS behave ? To goal is to find big subsets of mainly supported unicode characters (with the fact that I accept to not support old tech) Thank you

    Read the article

  • Python unicode search not giving correct answer

    - by user1318912
    I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found. This is the code: import codecs hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines() words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines() count_arr = [] for counter, line in enumerate(hypernyms): count_arr.append(0) for word in words: if line.find(word) >=0: count_arr[counter] +=1 for iterator, count in enumerate(count_arr): if count>0: print iterator, ' ', count This is finding some words, but ignoring some others The input files are: File-1: ???? ??????? File-2: ???????, ????-???? ?????-???, ?????-???, ?????_???, ?????_??? ????_????, ????-????, ???????_???? ????-???? This gives output: 0 1 3 1 Clearly, it is ignoring ??????? and searching for ???? only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?

    Read the article

  • Why is Cocoa using an old version of Unicode? [closed]

    - by Randy Marsh
    While I was searching for something in the Apple docs, I stumble on this: illegalCharacterSet Returns a character set containing values in the category of Non-Characters or that have not yet been defined in version 3.2 of the Unicode standard. On the Unicode website, I find that the latest version is 6.1.0. That's a lot of major versions higher than what Cocoa supports. Does somebody know why Apple doesn't upgrade their framework? My more important question is: Are there problems for not doing having support for Unicode 3.2+? Will I have problems reading Unicode files created on other systems with a more recent version of Unicode?

    Read the article

  • Does semi-normalization exist as a concept? Is it "normalized"?

    - by Gracchus
    If you don't mind, a tldr on my experience: My experience tldr I have an application that's heavily dependent upon uncertainty, a bane to database design. I tried to normalize it as best as I could according to the capabilities of my database of choice, but a "simple" query took 50ms to read. Nosql appeals to me, but I can't trust myself with it, and besides, normalization has cut down my debugging time immensely over and over. Instead of 100% normalization, I made semi-redundant 1:1 tables with very wide primary keys and equivalent foreign keys. Read times dropped to a few ms, and write times barely degraded. The semi-normalized point Given this reality, that anyone who's tried to rely upon views of fully normalized data is aware of, is this concept codified? Is it as simple as having wide unique and foreign keys, or are there any hidden secrets to this technique? Or is uncertainty merely a special case that has extremely limited application and can be left on the ash heap?

    Read the article

  • Normalization in plain English

    - by Yada
    I sort of understand the concept of database normalization but always have a hard time explaining it in plain English especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to none developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind. Does anyone was a nice way to explain the concept of database normalization in plain English. And what are some nice examples to show the differences between first, second and third normal forms. Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database. What key points are the interviewer looking for?

    Read the article

  • How to convert image into unicode

    - by Zahida Raeesi
    Hello there: I have created a Baluchi keyboard via ARABIC keyboard. But there are few keys not available in Arabic too. I tried different combination of keys to fulfill my requirement but now issue is that for a specific key there is no unicode combination available in UNICODE chart. plz help me out to covert this image in proper UNICODE text so that I can update my Baluchi keyboard Looking forward for your prompt and positive response with best regards, Raji Baloch

    Read the article

  • Perl latin-9? Unicode - need to add support

    - by Phill Pafford
    I have an application that is being expanded to the UK and I will need to add support for Latin-9 Unicode. I have done some Googling but found nothing solid as to what is involved in the process. Any tips? Here is some code (Just the bits for Unicode stuff) use Unicode::String qw(utf8 latin1 utf16); # How to call $encoded_txt = $self->unicode_encode($item->{value}); # Function part sub unicode_encode { shift() if ref($_[0]); my $toencode = shift(); return undef unless defined($toencode); Unicode::String->stringify_as("utf8"); my $unicode_str = Unicode::String->new(); # encode Perl UTF-8 string into latin1 Unicode::String # - currently only Basic Latin and Latin 1 Supplement # are supported here due to issues with Unicode::String . $unicode_str->latin1( $toencode ); ... Any help would be great and thanks.

    Read the article

  • Why many designs ignore normalization in RDBMS?

    - by Yosi
    I got to see many designs that normalization wasn't the first consideration in decision making phase. In many cases those designs included more than 30 columns, and the main approach was "to put everything in the same place" According to what I remember normalization is one of the first, most important things, so why is it dropped so easily sometimes? Edit: Is it true that good architects and experts choose a denormalized design while non-experienced developers choose the opposite? What are the arguments against starting your design with normalization in mind?

    Read the article

  • Replacing latex with unicode symbols

    - by Elazar Leibovich
    Often, during a conversation or an email, or at a forum, I would like to type some math, but I don't need full equation support. Unicode symbols should suffice. What I need is an easy way to type math related unicode symbols. Since I already know latex, it makes sense to use the latex symbol mnemonics to type the math symbols. What I currently did is to write an AutoHotKey script which automatically replaces \latexSymbol with the corresponding unicode symbol, using the "hotstrings" AutoHotKey feautres. However, the AutoHotKey hotstrings proved unstable for many strings. Having a couple of tens lines would cause AHK to fail recognizing the strings from time to time. Any other solution? (No, Alt+unicode number isn't convenient enough) Attached is my AHK script. The PutUni function is taken from here. ::\infty:: PutUni("e2889e") return ::\sum:: PutUni("e28891") return ::\int:: PutUni("e288ab") return ::\pm:: PutUni("c2b1") return ::\alpha:: PutUni("c991") return ::\beta:: PutUni("c992") return ::\phi:: PutUni("c9b8") return ::\delta:: PutUni("ceb4") return ::\pi:: PutUni("cf80") return ::\omega:: PutUni("cf89") return ::\in:: PutUni("e28888") return ::\notin:: PutUni("e28889") return ::\iff:: PutUni("e28794") return ::\leq:: PutUni("e289a4") return ::\geq:: PutUni("e289a5") return ::\sqrt:: PutUni("e2889a") return ::\neq:: PutUni("e289a0") return ::\subset:: PutUni("e28a82") return ::\nsubset:: PutUni("e28a84") return ::\nsubseteq:: PutUni("e28a88") return ::\subseteq:: PutUni("e28a86") return ::\prod:: PutUni("e2888f") return ::\N:: PutUni("e28495") return

    Read the article

1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >