encodings - Page 5 - Developer IT

Handling over-long UTF-8 sequences

- by Grant McLean

I've just been reworking my Encoding::FixLatin Perl module to handle over-long utf8 byte sequences and convert them to the shortest normal form. My question is quite simply "is this a bad idea"? A number of sources (including this RFC) suggest that any over-long utf8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe. Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form. Am I missing something. Is there a hidden danger I haven't considered?

Read the article

PostScript versus PDF as an output format

- by Brecht Machiels

I'm currently writing a typesetting application and I'm using PSG as the backend for producing postscript files. I'm now wondering whether that choice makes sense. It seems the ReportLab Toolkit offers all the features PSG offers, and more. ReportLab outputs PDF however. Advantages PDF offers: transparancy better support for character encodings (Unicode, for example) ability to embed TrueType and even OpenType fonts hyperlinks and bookmarks Is there any reason to use Postscript instead of directly outputting to PDF? While Postscript is a full programming language as opposed to PDF, as a basic output format for documents, that doesn't seem to offer any advantage. I assume a PDF can be readily converted to PostScript for printing? Some useful links: Wikipedia: PDF Adobe: PostScript vs. PDF

Read the article

Prefered method for looping sound flash as3

- by Brian Heylin

Hi there, I'm having some issues with looping a sound in flash AS3, in that when I tell the sound to loop I get a slight delay at the end/beginning of the audio. The audio is clipped correctly and will play without a gap on garage band. I know that there are issues with sound in general in flash, bugs with encodings and the inaccuracies with the SOUND_COMPLETE event (And Adobe should be embarrassed with their handling of these issues) I have tried to use the built in loop argument in the play method on the Sound class and also react on the SOUND_COMPLETE event, but both cause a delay. But has anyone come up with a technique for looping a sound without any noticeable gap?

Read the article

Set a script to automatically detect character encoding in a plain-text-file in Python?

- by Haidon

I've set up a script that basically does a large-scale find-and-replace on a plain text document. At the moment it works fine with ASCII, UTF-8, and UTF-16 (and possibly others, but I've only tested these three) encoded documents so long as the encoding is specified inside the script (the example code below specifies UTF-16). Is there a way to make the script automatically detect which of these character encodings is being used in the input file and automatically set the character encoding of the output file the same as the encoding used on the input file? findreplace = [ ('term1', 'term2'), ] inF = open(infile,'rb') s=unicode(inF.read(),'utf-16') inF.close() for couple in findreplace: outtext=s.replace(couple[0],couple[1]) s=outtext outF = open(outFile,'wb') outF.write(outtext.encode('utf-16')) outF.close() Thanks!

Read the article

Load JSON in Python as header character set

- by mridang

Hi everyone, I've always found character sets and encodings complicated to understand and here I'm faced with another problem. My apologies for any inaccuracies. I'll do my best. I'm requesting data from a server which returns JSON. In the HTTP headers it also returns the character set like so: Content-Type: text/html; charset=UTF-8 I'm using the JSON library in Python to load the JSON using the json.loads method. When I pass it the returned JSON, it gives me a dictionary in Unicode. I've Googled around and I know that JSON should return Unicode as JavaScript strings are Unicode objects. How can I load the JSON as UTF-8? I would like to use the same encoding as specified in the response header. I've read this post but it didn't help. Thank you.

Read the article

Load JSON in Python as header chracterset

- by mridang

Hi everyone, I've always found character-sets and encodings complicated to understand and here I'm faced with another problem. My apologies for any inaccuracies. I'll do my best. I'm requesting data from a server which returns JSON. In the HTTP headers it also returns the character.set like so: Content-Type: text/html; charset=UTF-8 I'm using the JSON library in python to load the JSON using the json.loads method. When I pass it the returned JSON, it gives me a dictionary in Unicode. I've Googled around and I know that JSON should return Unicode as JavaScript strings are Unicode objects. How can I load the JSON as UTF-8. I would like to use the same encoding as specified in the response header. I've read this post but it didn't help. Thank you.

Read the article

Should I convert overlong UTF-8 strings to their shortest normal form?

- by Grant McLean

I've just been reworking my Encoding::FixLatin Perl module to handle overlong UTF-8 byte sequences and convert them to the shortest normal form. My question is quite simply "is this a bad idea"? A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe. Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form. Am I missing something. Is there a hidden danger I haven't considered?

Read the article

Can I turn off implicit Python unicode conversions to find my mixed-strings bugs?

- by Tal Weiss

When profiling our code I was surprised to find millions of calls to C:\Python26\lib\encodings\utf_8.py:15(decode) I started debugging and found that across our code base there are many small bugs, usually comparing a string to a unicode or adding a sting and a unicode. Python graciously decodes the strings and performs the following operations in unicode. How kind. But expensive! I am fluent in unicode, having read Joel Spolsky and Dive Into Python... I try to keep our code internals in unicode only. My question - can I turn off this pythonic nice-guy behavior? At least until I find all these bugs and fix them (usually by adding a u'u')? Some of them are extremely hard to find (a variable that is sometimes a string...). Python 2.6.5 (and I can't switch to 3.x).

Read the article

Using C# to detect whether a filename character is considered international

- by Morten Mertner

I've written a small console application (source below) to locate and optionally rename files containing international characters, as they are a source of constant pain with most source control systems (some background on this below). The code I'm using has a simple dictionary with characters to look for and replace (and nukes every other character that uses more than one byte of storage), but it feels very hackish. What's the right way to (a) find out whether a character is international? and (b) what the best ASCII substitution character would be? Let me provide some background information on why this is needed. It so happens that the danish Å character has two different encodings in UTF-8, both representing the same symbol. These are known as NFC and NFD encodings. Windows and Linux will create NFC encoding by default but respect whatever encoding it is given. Mac will convert all names (when saving to a HFS+ partition) to NFD and therefore returns a different byte stream for the name of a file created on Windows. This effectively breaks Subversion, Git and lots of other utilities that don't care to properly handle this scenario. I'm currently evaluating Mercurial, which turns out to be even worse at handling international characters.. being fairly tired of these problems, either source control or the international character would have to go, and so here we are. My current implementation: public class Checker { private Dictionary<char, string> internationals = new Dictionary<char, string>(); private List<char> keep = new List<char>(); private List<char> seen = new List<char>(); public Checker() { internationals.Add( 'æ', "ae" ); internationals.Add( 'ø', "oe" ); internationals.Add( 'å', "aa" ); internationals.Add( 'Æ', "Ae" ); internationals.Add( 'Ø', "Oe" ); internationals.Add( 'Å', "Aa" ); internationals.Add( 'ö', "o" ); internationals.Add( 'ü', "u" ); internationals.Add( 'ä', "a" ); internationals.Add( 'é', "e" ); internationals.Add( 'è', "e" ); internationals.Add( 'ê', "e" ); internationals.Add( '¦', "" ); internationals.Add( 'Ã', "" ); internationals.Add( '©', "" ); internationals.Add( ' ', "" ); internationals.Add( '§', "" ); internationals.Add( '¡', "" ); internationals.Add( '³', "" ); internationals.Add( '', "" ); internationals.Add( 'º', "" ); internationals.Add( '«', "-" ); internationals.Add( '»', "-" ); internationals.Add( '´', "'" ); internationals.Add( '`', "'" ); internationals.Add( '"', "'" ); internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 147 } )[ 0 ], "-" ); internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 148 } )[ 0 ], "-" ); internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 153 } )[ 0 ], "'" ); internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 166 } )[ 0 ], "." ); keep.Add( '-' ); keep.Add( '=' ); keep.Add( '\'' ); keep.Add( '.' ); } public bool IsInternationalCharacter( char c ) { var s = c.ToString(); byte[] bytes = Encoding.UTF8.GetBytes( s ); if( bytes.Length > 1 && ! internationals.ContainsKey( c ) && ! seen.Contains( c ) ) { Console.WriteLine( "X '{0}' ({1})", c, string.Join( ",", bytes ) ); seen.Add( c ); if( ! keep.Contains( c ) ) { internationals[ c ] = ""; } } return internationals.ContainsKey( c ); } public bool HasInternationalCharactersInName( string name, out string safeName ) { StringBuilder sb = new StringBuilder(); Array.ForEach( name.ToCharArray(), c => sb.Append( IsInternationalCharacter( c ) ? internationals[ c ] : c.ToString() ) ); int length = sb.Length; sb.Replace( " ", " " ); while( sb.Length != length ) { sb.Replace( " ", " " ); } safeName = sb.ToString().Trim(); string namePart = Path.GetFileNameWithoutExtension( safeName ); if( namePart.EndsWith( "." ) ) safeName = namePart.Substring( 0, namePart.Length - 1 ) + Path.GetExtension( safeName ); return name != safeName; } } And this would be invoked like this: FileInfo file = new File( "Århus.txt" ); string safeName; if( checker.HasInternationalCharactersInName( file.Name, out safeName ) ) { // rename file }

Read the article

how to show the right word in my code, my code is : os.urandom(64)

- by zjm1126

My code is: print os.urandom(64) which outputs: > "D:\Python25\pythonw.exe" "D:\zjm_code\a.py" \xd0\xc8=<\xdbD' \xdf\xf0\xb3>\xfc\xf2\x99\x93 =S\xb2\xcd'\xdbD\x8d\xd0\\xbc{&YkD[\xdd\x8b\xbd\x82\x9e\xad\xd5\x90\x90\xdcD9\xbf9.\xeb\x9b>\xef#n\x84 which isn't readable, so I tried this: print os.urandom(64).decode("utf-8") but then I get: > "D:\Python25\pythonw.exe" "D:\zjm_code\a.py" Traceback (most recent call last): File "D:\zjm_code\a.py", line 17, in <module> print os.urandom(64).decode("utf-8") File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-3: invalid data What should I do to get human-readable output?

Read the article

How do HTTP proxy caches decide between serving identity- vs. gzip-encoded resources?

- by mrclay

An HTTP server uses content-negotiation to serve a single URL identity- or gzip-encoded based on the client's Accept-Encoding header. Now say we have a proxy cache like squid between clients and the httpd. If the proxy has cached both encodings of a URL, how does it determine which to serve? The non-gzip instance (not originally served with Vary) can be served to any client, but the encoded instances (having Vary: Accept-Encoding) can only be sent to a clients with the identical Accept-Encoding header value as was used in the original request. E.g. Opera sends "deflate, gzip, x-gzip, identity, *;q=0" but IE8 sends "gzip, deflate". According to the spec, then, caches shouldn't share content-encoded caches between the two browsers. Is this true?

Read the article

Why don't scripting languages output Unicode to the Windows console?

- by hippietrail

The Windows console has been Unicode aware for at least a decade and perhaps as far back as Windows NT. However for some reason the major cross-platform scripting languages including Perl and Python only ever output various 8-bit encodings, requiring much trouble to work around. Perl gives a "wide character in print" warning, Pythong gives a charmap error and quits. Why on earth after all these years do they not just simply call the Win32 -W APIs that output UTF-16 Unicode instead of forcing everything through the ANSI/codepage bottleneck? Is it just that cross-platform performance is low priority? Is it that the languages use UTF-8 internally and find it too much bother to output UTF-16? Or are the -W APIs inherently broken to such a degree that they can't be used as-is?

Read the article

Trouble converting string/character to byte in lisp

- by WanderingPhd

I've some data that I'm reading in using read-line and I want to convert it into a byte-array. babel:string-to-octet works for the most part except when the character\byte is larger (above 200) in which case it returns two numbers. As an example, if the character is ú using babel:string-to-octet returns (195 185) instead of 250 which is what I'm looking for. I tried a number of encodings in babel but none of them seem to work. If I use read-byte or read-sequence it does read in 250. But for reasons of backward compatibility, I'm left with using read-line and I would like to know if there is something I'm missing when using babel:string-to-octet to convert ú to 250. I'm using ccl 1.8 btw.

Read the article

Should I convert overly-long UTF-8 strings to their shortest normal form?

- by Grant McLean

I've just been reworking my Encoding::FixLatin Perl module to handle overly-long UTF-8 byte sequences and convert them to the shortest normal form. My question is quite simply "is this a bad idea"? A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe. Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form. Am I missing something. Is there a hidden danger I haven't considered?

Read the article

MODx character encoding

- by Piet

Ahhh character encodings. Don’t you just love them? Having character issues in MODx? Then probably the MODx manager character encoding, the character encoding of the site itself, your database’s character encoding, or the encoding MODx/php uses to talk to MySQL isn’t correct. The Website Encoding Your MODx site’s character encoding can be configured in the manager under Tools/Configuration/Site/Character encoding. This is the encoding your website’s visitors will get. The Manager’s Encoding The manager’s encoding can be changed by setting $modx_manager_charset at manager/includes/lang/<language>.inc.php like this (for example): $modx_manager_charset = 'iso-8859-1'; To find out what language you’re using (and thus was file you need to change), check Tools/Configuration/Site/Language (1 line above the character encoding setting). This needs to be the same encoding as your site. You can’t have your manager in utf8 and your site in iso-8859-1. Your Database’s Encoding The charset MODx/php uses to talk to your database can be set by changing $database_connection_charset in manager/includes/config.inc.php. This needs to be the same as your database’s charset. Make sure you use the correct corresponding charset, for iso-8859-1 you need to use ‘latin1′. Utf8 is just utf8. Example: $database_connection_charset = 'latin1'; Now, if you check Reports/System info, the ‘Database Charset’ might say something else. This is because the mysql variable ‘character_set_database’ is displayed here, which contains the character set used by the default database and not the one for the current database/connection. However, if you’d change this to display ‘character_set_connection’, it could still say something else because the ’set character set’ statement used by MODx doesn’t change this value either. The ’set names’ statement does, but since it turns out my MODx install works now as expected I’ll just leave it at this before I get a headache. If I saved you a potential headache or you think I’m totally wrong or overlooked something, let me know in the comments. btw: I want to be able to use a real editor with MODx. Somehow.

Read the article

Should UTF-16 be considered harmful?

- by Artyom

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?" Why do I ask this question? How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more than one element. I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements). For example, try to edit one of these characters: 𝄞 (U+1D11E) MUSICAL SYMBOL G CLEF 𝕥 (U+1D565) MATHEMATICAL DOUBLE-STRUCK SMALL T 𝟶 (U+1D7F6) MATHEMATICAL MONOSPACE DIGIT ZERO 𠂊 (U+2008A) Han Character You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference. For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad: Opera has problem with editing them (delete required 2 presses on backspace) Notepad can't deal with them correctly (delete required 2 presses on backspace) File names editing in Window dialogs in broken (delete required 2 presses on backspace) All QT3 applications can't deal with them - show two empty squares instead of one symbol. Python encodes such characters incorrectly when used directly u'X'!=unicode('X','utf-16') on some platforms when X in character outside of BMP. Python 2.5 unicodedata fails to get properties on such characters when python compiled with UTF-16 Unicode strings. StackOverflow seems to remove these characters from the text if edited directly in as Unicode characters (these characters are shown using HTML Unicode escapes). WinForms TextBox may generate invalid string when limited with MaxLength. It seems that such bugs are extremely easy to find in many applications that use UTF-16. So... Do you think that UTF-16 should be considered harmful?

Read the article

iconv supports too few encoding

- by schemacs

iconv -l outputs too few encodings on CentOS 6.5: $ iconv -l 10646-1:1993, 10646-1:1993/UCS4, ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ASCII, CP367, CSASCII, CSUCS4, IBM367, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4, ISO-10646/UTF-8, ISO-10646/UTF8, ISO-IR-6, ISO-IR-193, ISO646-US, ISO_646.IRV:1991, OSF00010020, OSF00010100, OSF00010101, OSF00010102, OSF00010104, OSF00010105, OSF00010106, OSF05010001, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UCS2, UCS4, UNICODEBIG, UNICODELITTLE, US-ASCII, US, UTF-8, UTF8, WCHAR_T But on my Ubuntu the list seems much longer, here is different: CentOS6.5: $ php -a php > echo iconv('utf8', 'gbk', 'abc'); PHP Notice: iconv(): Wrong charset, conversion from `utf8' to `gbk' is not allowed in php shell code on line 1 php > quit $ php -i|grep iconv iconv iconv support => enabled iconv implementation => glibc iconv library version => 2.12 iconv.input_encoding => ISO-8859-1 => ISO-8859-1 iconv.internal_encoding => ISO-8859-1 => ISO-8859-1 iconv.output_encoding => ISO-8859-1 => ISO-8859-1 Ubuntu 14.04: $ php -a Interactive mode enabled php > echo iconv('utf8', 'gbk', "abc\n"); abc php > quit $ php -i|grep iconv iconv iconv support => enabled iconv implementation => glibc iconv library version => 2.19 iconv.input_encoding => ISO-8859-1 => ISO-8859-1 iconv.internal_encoding => ISO-8859-1 => ISO-8859-1 iconv.output_encoding => ISO-8859-1 => ISO-8859-1 But I don't want to recompile glibc(this will be huge mount of work), any idea on how to ad new encoding support?

Read the article

PhpMyAdmin import/export - strange character encoding issues.

- by John Hunt

Hello, I'm migrating a site to a new host, and there are a couple of databases on there. There's no SSH access so I'm stuck with phpmyadmin. The issue is that certain characters (namely just whitespace) seems to being corrupt on the new site (same html, and apache doesn't seem to be messing with any encodings - you can see the strange characters have changed when I use less on my linux machine after downloading a table dump from both servers.) The issue isn't as bad if I import into the new database as utf-8 - whitespace characters only have one funny A type symbol instead of two. I've been trying various combinations of character encoding etc to no avail. Exporting from: phpMyAdmin 2.6.2 MySQL 4.1.20 MySQL connection collation: utf8_general_ci MySQL charset: UTF-8 Unicode (utf8) Collation on tables and their fields is: latin1_swedish_ci Importing to: phpMyAdmin - 2.11.9.2 MySQL client version: 5.0.45 MySQL charset: UTF-8 Unicode (utf8) MySQL connection collation: utf8_general_ci The import sql has this kind of thing in it: ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=192 ; I get the impression this is actually a bug or something with mysqldump as nothing seems to work.. does anyone have any insight into this? Cheers, John.

Read the article

How to detect UTF-8-based encoded strings [closed]

- by Diego Sendra

A customer of asked us to build him a multi-language based support VB6 scraper, for which we had the need to detect UTF-8 based encoded strings to decode it later for proper displaying in application UI. It's necessary to point out that this need arises based on VB6 limitations to natively support UTF-8 in its controls, contrary to what it happens in .NET where you can tell a control that it should expect UTF-8 encoding. VB6 natively supports ISO 8859-1 and/or Windows-1252 encodings only, for which textboxes, dropdowns, listview controls, others can't be defined to natively support/expect UTF-8 as you can do in .NET considering what we just explained; so we would see weird symbols such as Ã©, Ã¨ among others, making it a whole mess at the time of displaying. So, next function contains whole UTF-8 encoded punctuation marks and symbols from languages like Spanish, Italian, German, Portuguese, French and others, based on an excellent UTF-8 based list we got from this link - Ref. http://home.telfort.nl/~t876506/utf8tbl.html Basically, the function compares if each and one of the listed UTF-8 encoded sentences, separated by | (pipe) are found in our passed string making a substring search first. Whether it's not found, it makes an alternative ASCII value based search to get a match. Say, a string like "Societé" (Society in english) would return FALSE through calling isUTF8("Societé") while it would return TRUE when calling isUTF8("SocietÃˆ") since Ãˆ is the UTF-8 encoded representation of é. Once you got it TRUE or FALSE, you can decode the string through DecodeUTF8() function for properly displaying it, a function we found somewhere else time ago and also included in this post. Function isUTF8(ByVal ptstr As String) Dim tUTFencoded As String Dim tUTFencodedaux Dim tUTFencodedASCII As String Dim ptstrASCII As String Dim iaux, iaux2 As Integer Dim ffound As Boolean ffound = False ptstrASCII = "" For iaux = 1 To Len(ptstr) ptstrASCII = ptstrASCII & Asc(Mid(ptstr, iaux, 1)) & "|" Next tUTFencoded = "Ã„|Ã…|Ã‡|Ã‰|Ã‘|Ã–|ÃŒ|Ã¡|Ã|Ã¢|Ã¤|Ã£|Ã¥|Ã§|Ã©|Ã¨|Ãª|Ã«|Ã|Ã¬|Ã®|Ã¯|Ã±|Ã³|Ã²|Ã´|Ã¶|Ãµ|Ãº|Ã¹|Ã»|Ã¼|â€|Â°|Â¢|Â£|Â§|â€¢|Â¶|ÃŸ|Â®|Â©|â„¢|Â´|Â¨|â‰|Ã†|Ã˜|âˆž|Â±|â‰¤|â‰¥|Â¥|Âµ|âˆ‚|âˆ‘|âˆ|Ï€|âˆ«|Âª|Âº|Î©|Ã¦|Ã¸|Â¿|Â¡|Â¬|âˆš|Æ’|â‰ˆ|âˆ†|Â«|Â»|â€¦|Â|Ã€|Ãƒ|Ã•|Å’|Å“|â€“|â€”|â€œ|â€|â€˜|â€™|Ã·|â—Š|Ã¿|Å¸|â„|â‚¬|â€¹|â€º|ï¬|ï¬‚|â€¡|Â·|â€š|â€ž|â€°|Ã‚|Ãš|Ã|Ã‹|Ãˆ|Ã|ÃŽ|Ã|ÃŒ|Ã“|Ã”|ï£¿|Ã’|Ãš|Ã›|Ã™|Ä±|Ë†|Ëœ|Â¯|Ë˜|Ë™|Ëš|Â¸|Ë|Ë›|Ë‡" & _ "Å|Å¡|Â¦|Â²|Â³|Â¹|Â¼|Â½|Â¾|Ã|Ã—|Ã|Ãž|Ã°|Ã½|Ã¾" & _ "â‰|âˆž|â‰¤|â‰¥|âˆ‚|âˆ‘|âˆ|Ï€|âˆ«|Î©|âˆš|â‰ˆ|âˆ†|â—Š|â„|ï¬|ï¬‚|ï£¿|Ä±|Ë˜|Ë™|Ëš|Ë|Ë›|Ë‡" tUTFencodedaux = Split(tUTFencoded, "|") If UBound(tUTFencodedaux) > 0 Then iaux = 0 Do While Not ffound And Not iaux > UBound(tUTFencodedaux) If InStr(1, ptstr, tUTFencodedaux(iaux), vbTextCompare) > 0 Then ffound = True End If If Not ffound Then 'ASCII numeric search tUTFencodedASCII = "" For iaux2 = 1 To Len(tUTFencodedaux(iaux)) 'gets ASCII numeric sequence tUTFencodedASCII = tUTFencodedASCII & Asc(Mid(tUTFencodedaux(iaux), iaux2, 1)) & "|" Next 'tUTFencodedASCII = Left(tUTFencodedASCII, Len(tUTFencodedASCII) - 1) 'compares numeric sequences If InStr(1, ptstrASCII, tUTFencodedASCII) > 0 Then ffound = True End If End If iaux = iaux + 1 Loop End If isUTF8 = ffound End Function Function DecodeUTF8(s) Dim i Dim c Dim n s = s & " " i = 1 Do While i <= Len(s) c = Asc(Mid(s, i, 1)) If c And &H80 Then n = 1 Do While i + n < Len(s) If (Asc(Mid(s, i + n, 1)) And &HC0) <> &H80 Then Exit Do End If n = n + 1 Loop If n = 2 And ((c And &HE0) = &HC0) Then c = Asc(Mid(s, i + 1, 1)) + &H40 * (c And &H1) Else c = 191 End If s = Left(s, i - 1) + Chr(c) + Mid(s, i + n) End If i = i + 1 Loop DecodeUTF8 = s End Function

Read the article

configuring local W3C validator on xampp - windows xp sp3

- by Gabriel

Hi I have no experience on perl. I am trying to configure the W3C validator on my localhost (win32). I have already followed all the instructions given by W3C @http://validator.w3.org/docs/install_win.html, but I am getting the following error: Can't locate loadable object for module Encode::HanExtra in @INC (@INC contains: C:/xampp/perl/lib C:/xampp/perl/site/lib .) at C:/xampp/validator-0.8.6/httpd/cgi-bin/check line 49 Compilation failed in require at C:/xampp/validator-0.8.6/httpd/cgi-bin/check line 49. BEGIN failed--compilation aborted at C:/xampp/validator-0.8.6/httpd/cgi-bin/check line 49. I'm running perl 5.10.1 on xampp and the following HandExtra Module http://cpansearch.perl.org/src/AUDREYT/Encode-HanExtra-0.23/lib/Encode/HanExtra.pm This is C:/xampp/validator-0.8.6/httpd/cgi-bin/check line 49 : use Encode::HanExtra qw(); # for some chinese character encodings if I document that line using # I get a similar message related to other object: Can't locate loadable object for module Sub::Name in @INC (@INC contains: C:/xampp/perl/lib C:/xampp/perl/site/lib .) at C:/xampp/perl/site/lib/Moose.pm line 12 This is Line 12 at Moose.pm : use Sub::Name 'subname'; I don't know how to proceed. I will appreciate any advise. Thanks

Read the article

How do I remove accents from characters in a PHP string?

- by georgebrock

I'm attempting to remove accents from characters in PHP string as the first step to making the string usable in a URL. I'm using the following code: $input = "Fóø Bår"; setlocale(LC_ALL, "en_US.utf8"); $output = iconv("utf-8", "ascii//TRANSLIT", $input); print($output); The output I would expect would be something like this: F'oo Bar However, instead of the accented characters being transliterated they are replaced with question marks: F?? B?r Everything I can find online indicates that setting the locale will fix this problem, however I'm already doing this. I've already checked the following details: The locale I am setting is supported by the server (included in the list produced by locale -a) The source and target encodings (UTF-8 and ASCII) are supported by the server's version of iconv (included in the list produced by iconv -l) The input string is UTF-8 encoded (verified using PHP's mb_check_encoding function, as suggested in the answer by mercator) The call to setlocale is successful (it returns 'en_US.utf8' rather than FALSE) The cause of the problem: The server is using the wrong implementation of iconv. It has the glibc version instead of the required libiconv version. Note that the iconv function on some systems may not work as you expect. In such case, it'd be a good idea to install the GNU libiconv library. It will most likely end up with more consistent results. – PHP manual's introduction to iconv Details about the iconv implementation that is used by PHP are included in the output of the phpinfo function. (I'm not able to re-compile PHP with the correct iconv library on the server I'm working with for this project so the answer I've accepted below is the one that was most useful for removing accents without iconv support.)

Read the article

doublechecking: no db-wide 'unicode switch' for sql server in the foreseeable future, i.e. like Orac

- by user72150

Hi all, I believe I know the answer to this question, but wanted to confirm: Question Does Sql server (or will it in the foreseeable future), offer a database-wide "unicode switch" which says "store all characters in unicode (UTF-16, UCS-2, etc)", i.e. like Oracle. The Context Our application has provided "CJK" (Chinese-Japanese-Korean) support for years--using Oracle as the db store. Recently folks have been asking for the same support in sql server. We store our db schema definition in xml and generate the vendor-specific definitions (oracle, sql server) using vendor-specific xsl. We can make the change easily. The problem is for upgrades. Generated scripts would need to change the column types for 100+ columns from varchar to nvarchar, varchar(max) to nvarchar(max), etc. These changes require dropping and recreating indexes and foreign keys if the any indexes/fk's exist on the column. Non-trivial. Risky. DB-wide character encodings for us would eliminate programming changes. (I.e. we would not to change the column types from varchar to nvarchar; sql server would correctly store unicode data in varchar columns). I had thought that eventually sql server would "see the light" and allow storing unicode in varchar/clob columns. Evidently not yet. Recap So just to triple check: does mssql offer a database-wide switch for character encoding? Will it in SQL2008R3? or 2010? thanks, bill

Read the article

Should UTF-16 be considered harmful?

- by Artyom

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?" Why do I ask this question? How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more then one element. I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements). For example, try to edit one of these characters: 𝄞 𝕥 𝟶 𠂊 You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference. For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad: Opera has problem with editing them Notepad can't deal with them correctly (delete for example) File names editing in Window dialogs in broken All QT3 applications can't deal with them. StackOverflow seems to remove these characters if edited directly in as Unicode characters, and only seems to allow them as HTML Unicode escapes. So... This was very simple test. Do you think that UTF-16 should be considered harmful?

Read the article

WCF Message Debugging on WebHttpBehavior

- by Programming Hero

I've created a custom binding in WCF for a custom MessageEncoder to allow messages to be written as XML using a wider range of encodings than WCF supports out of the box. The encoder appears to be working and I am able to send and receive messages, but I want to verify that the XML message being written is exactly as required by the service I am trying to consume. I've turned on message logging for WCF using the diagnostic trace listeners to output the messages sent and received over the wire to a log file. Unfortunately, for calls using my encoder, the message is displayed as ... stream ... EDIT: I don't think it's anything to do with my custom encoding. I have experimented with my custom binding a little, switching to using the built-in text encoding and http transport. I still don't get a message body logged in the message trace. EDIT2: Having done further investigation, the issue looks to be related not to the custom binding, but the custom behaviour. I'm apply the <webHttp/> behaviour. Once this is specified (along with manual addressing), the tracing behaviour shows up. Is this a known issue with WebHttpBehavior? Am I still barking up the wrong tree?

Read the article

Getting a binary file from resource in C#

- by Jesse Knott

Hello, I am having a bit of a problem, I am trying to get a PDF as a resource in my application. At this point I have a fillable PDF that I have been able to store as a file next to the binary, but now I am trying to embed the PDF as a resource in the binary. byte[] buffer; try { s = typeof(BattleTracker).Assembly.GetManifestResourceStream("libReports.Resources.DAForm1594.pdf"); buffer = new byte[s.Length]; int read = 0; do { read = s.Read(buffer, read, 32768); } while (read > 0); } catch (Exception e) { throw new Exception("Error: could not import report:", e); } // read existing PDF document PdfReader r = new PdfReader( // optimize memory usage buffer, null ); Every time I run the code I get an error saying "Rebuild trailer not found. Original Error: PDF startxref not found" When I was just opening the file via a path to the static file in my directory it worked fine. I have tried using different encodings UTF-8, UTF-32, UTF-7, ASCII, etc etc.... As a side note I had the same problem with getting a Powerpoint file as a resource, I was finally able to fix that problem by converting the Powerpoint to xml and using that. I have considered doing the same for the PDF but I am referencing elements by field name which does not seem to work with XML PDFs. Can anyone help me out with this? Thanks in advance!

Search Results

Search found 168 results on 7 pages for 'encodings'.

Page 5/7 | < Previous Page | 1 2 3 4 5 6 7 | Next Page >

- by Grant McLean

- by Brecht Machiels

- by Brian Heylin

- by Haidon

- by mridang

- by mridang

- by Grant McLean

- by Tal Weiss

- by Morten Mertner

- by zjm1126

- by mrclay

- by hippietrail

- by WanderingPhd

- by Grant McLean

- by Piet

- by Artyom

- by schemacs

- by John Hunt

- by Diego Sendra

- by Gabriel

- by georgebrock

- by user72150

- by Artyom

- by Programming Hero

- by Jesse Knott

< Previous Page | 1 2 3 4 5 6 7 | Next Page >