unicode normalization - Page 29

\w in PHP preg_replace covers only second byte of UTF-8 chars

- by Andrey

we have this code: $value = preg_replace("/[^\w]/", '', $value); where $value is in utf-8. After this transformation first byte of multibyte characters is stripped. How to make \w cover UTF-8 chars completely? Sorry, i am not very well in PHP

Read the article

Characters with jquery json

- by Mikk

Hi everyone, I'm using jquery $.getJSON to retrieve list of cities. Everything works fine, but I'm from Estonia (probably most of you don't know much about this country =D) and we are using some characters like õ, ü. ä, ö. When I pass letters like this to callback function, I keep getting empty strings. I've tried to base64 encode(server-side)-decode(jquery base64 plugin) strings (i thought it was a good idea as long as I can compress pages with php, so I don't have to worry about bandwidth), but in this way I end up with some random chinese symbols. What would be the best workaround for this problem. Thank you.

Read the article

Python: Removing particular character (u"\u2610") from string

- by duhaime

I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character. (You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.) To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove. for work in glob.glob(pathtofiles): openfile = open(work) readfile = openfile.read() stringfile = str(readfile) decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line? soup = BeautifulSoup(decodefile) textwithtags = soup.findAll('text') textwithtagsasstring = str(textwithtags) #this method strips everything between anglebrackets as it should textwithouttags = stripTags(textwithtagsasstring) #clean text nonewlines = textwithouttags.replace("\n", " ") noextrawhitespace = re.sub(' +',' ', nonewlines) print noextrawhitespace #the boxes appear I tried to remove the boxes by using noboxes = noextrawhitespace.replace(u"\u2610", "") But Python threw an error flag: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128) Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.

Read the article

Python's string.translate() doesn't fully work?

- by Rhubarb

Given this example, I get the error that follows: print u'\2033'.translate({2033:u'd'}) C:\Python26\lib\encodings\cp437.pyc in encode(self, input, errors) 10 11 def encode(self,input,errors='strict'): ---> 12 return codecs.charmap_encode(input,errors,encoding_map) 13 14 def decode(self,input,errors='strict'): UnicodeEncodeError: 'charmap' codec can't encode character u'\x83' in position 0

Read the article

wchar to char in c++

- by Chris

I have a Windows CE console application that's entry point looks like this int _tmain(int argc, _TCHAR* argv[]) I want to check the contents of argv[1] for "-s" convert argv[2] into an integer. I am having trouble narrowing the arguments or accessing them to test. I initially tried the following with little success if (argv[1] == L"-s") I also tried using the narrow function of wostringstream on each character but this crashed the application. Can anyone shed some light? Thanks

Read the article

Obtain File size with os.path.getsize() in Python 2.7.5

- by Ruxuan Ouyang

I am new to python. I am trying to use os.path.getsize() to obtain the file size. However, if the file name is not in Englist, but in Chinese, Gemany, or French, etc, Python cannot recognize it and do not return the size of the file. Could you please help me with it? How can I let Python recognize the file's name and return the size of these kind of files? For example: The file's name is:?????????? ????????????? ? ????????????? ???????? ?? 2030?.doc path="C:\xxxx\xxx\xxxx\?????????? ????????????? ? ????????????? ???????? ?? 2030?.doc" I'd like to use" os.path.getsize(path) But it does not recognize the file name. Could you please kindly tell me what should I do? Thank you very much!

Read the article

If a command line program is unsure of stdout's encoding, what encoding should it output?

- by mackstann

I have a command line program written in Python, and when I pipe it through another program on the command line, sys.stdout.encoding is None. This makes sense, I suppose -- the output could be another program, or a file you're redirecting it into, or whatever, and it doesn't know what encoding is desired. But neither do I! This program will be used by many different people (humor me) in different ways. Should I play it safe and output only ascii (replacing non-ascii chars with question marks)? Or should I output UTF-8, since it's so widespread these days?

Read the article

Four byte encoding of U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS)?

- by knorv

Which character encoding represents the character ö (U+00F6, LATIN SMALL LETTER O WITH DIAERESIS or simply put chr(246) in ISO-8859-1) as the four octets combination chr(195) . chr(63) . chr(194) . chr(164)?

Read the article

Why does Perl lose foreign characters on Windows input - can this be fixed (if so, how) or is Perl an outdated dinosaur that just can't handle this?

- by Alex R

Note below how ã changes to a This is causing me a huge problem as foreign characters show up in URLs, e.g. http://pt.wikipedia.org/wiki/Cão The OS is Windows 7, 64-bit. The Perl is: This is perl 5, version 12, subversion 2 (v5.12.2) built for MSWin32-x64-multi-thread (with 8 registered patches, see perl -V for more detail) Copyright 1987-2010, Larry Wall Binary build 1202 [293621] provided by ActiveState http://www.ActiveState.com Built Sep 6 2010 22:53:42 Additional update: To get around my particular problem, I tried using File::Find instead of piped input. The issue actually gets worse:

Read the article

Why does Perl lose foreign characters on Windows; can this be fixed (if so, how)?

- by Alex R

Note below how ã changes to a. NOTE2: Before you blame this on CMD.EXE and Windows pipe weirdness, see Experiment 2 below which gets a similar problem using File::Find. The particular problem I'm trying to fix involves working with image files stored on a local drive, and manipulating the file names which may contain foreign characters. The two experiments shown below are intermediate debugging steps. The ã character is common in latin languages. e.g. http://pt.wikipedia.org/wiki/Cão Experiment 1 Experiment 2 To get around my particular problem, I tried using File::Find instead of piped input. The issue actually gets worse: Debugging update: I tried some of the tricks listed at http://perldoc.perl.org/perlunicode.html, e.g. use utf8, use feature 'unicode_strings', etc, to no avail. Environment and Version Info The OS is Windows 7, 64-bit. The Perl is: This is perl 5, version 12, subversion 2 (v5.12.2) built for MSWin32-x64-multi-thread (with 8 registered patches, see perl -V for more detail) Copyright 1987-2010, Larry Wall Binary build 1202 [293621] provided by ActiveState http://www.ActiveState.com Built Sep 6 2010 22:53:42

Read the article

Working with Japanese filenames in PHP 5.3 and Windows Vista?

- by Jon

I'm currently trying to write a simple script that looks in a folder, and returns a list of all the file names in an RSS feed. However I've hit a major wall... Whenever I try to read filenames with Japanese characters in them, it shows them as ?'s. I've tried the solutions mentioned here: http://stackoverflow.com/questions/482342/php-readdir-problem-with-japanese-language-file-name - however they do not work for some reason, even with: header('Content-Type: text/html; charset=UTF-8'); setlocale(LC_ALL, 'en_US.UTF8'); mb_internal_encoding("UTF-8"); At the top (Exporting as plain text until I can sort this out). What can I do? I need this to work and I don't have much time.

Read the article

How can i convert from wostream to ostream

- by Avishay

i am using a function that receives ostream but i have wostream is there a way to convert one to the other? in particular i want to use boost::write_graphviz which takes ostream but i currently in << operator for wostream.

Read the article

Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8 in Perl

- by knorv

Consider the following problem: A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed. I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in the processing I want to provide a "best effort result" rather than throwing an error. My current attempt looks like this: $junk = &force_utf8($junk); sub force_utf8 { my $input = shift; my $output = ''; foreach my $line (split(/\n/, $input)) { if (utf8::valid($line)) { utf8::decode($line); } $output .= "$line\n"; } return $output; } While this appears to work I'm certain this is not the optimal solution. How would you improve the force_utf8(...) sub?

Read the article

Differentiate between TCHAR and _TCHAR

- by Vulcan Eager

What are the various differences between the two symbols TCHAR and _TCHAR type defined in the Windows header tchar.h? Explain with examples. Briefly describe scenarios where you would use TCHAR as opposed to _TCHAR in your code. (10 marks)

Read the article

How can I test if an input field contains foreign characters?

- by zeckdude

I have an input field in a form. Upon pushing submit, I want to validate to make sure the user entered non-latin characters only, so any foreign language characters, like Chinese among many others. Or at the very least test to make sure it does not contain any latin characters. Could I use a regular expression for this? What would be the best approach for this? I am validating in both javaScript and in PHP. What solutions can I use to check for foreign characters in the input field in both programming languages?

Read the article

Parsing through Arabic / RTL text from left to right

- by Dan W

Let's say I have a string in an RTL language such as Arabic with some English chucked in: string s = "Test:?????;?????;?????;a;b" Notice there are semicolons in the string. When I use the Split command like string[] spl = s.Split(';');, then some of the strings are saved in reverse order. This is what happens: ??Test:????? ????? ????? a b The above is out of order compared to the original. Instead, I expect to get this: ?Test: ????? ????? ????? a b I'm prepared to write my own split function. However, the chars in the string also parse in reverse order, so I'm back to square one. I just want to go through each character as it's shown on the screen.

Read the article

PHP: Convert web-page to utf8

- by Paul Tarjan

I would like to only work with UTF8. The problem is I don't know the charset of every webpage. How can I detect it and convert to UTF8? <?php $url = "http://vkontakte.ru"; $ch = curl_init($url); $options = array( CURLOPT_RETURNTRANSFER => true, ); curl_setopt_array($ch, $options); $data = curl_exec($ch); // $data = magic($data); print $data; See this at: http://paulisageek.com/tmp/curl-utf8 What is magic()?

Read the article

What is the difference between _tmain() and main() in C++?

- by joshcomley

If I run my C++ application with the following main() method everything is OK: int main(int argc, char *argv[]) { cout << "There are " << argc << " arguments:" << endl; // Loop through each argument and print its number and value for (int i=0; i<argc; i++) cout << i << " " << argv[i] << endl; return 0; } I get what I expect and my arguments are printed out. However, if I use _tmain: int _tmain(int argc, char *argv[]) { cout << "There are " << argc << " arguments:" << endl; // Loop through each argument and print its number and value for (int i=0; i<argc; i++) cout << i << " " << argv[i] << endl; return 0; } It just displays the first character of each argument. What is the difference causing this?

Read the article

convert ü to u

- by Remus Rigo

hi all I'm using a database that contains contacts (fields like name, address, ...). If i'm using in my database a city that contains special chars (like ü) or html codes (like ü), then how can i convert them to u, so when i search for a city that contains that a special char should be shown in the result...

Read the article

F5 Networks iRule/Tcl - Escaping UNICODE 6-character escape sequences so they are processed as and r

- by openid.malcolmgin.com

We are trying to get an F5 BIG-IP LTM iRule working properly with SharePoint 2007 in an SSL termination role. This architecture offloads all of the SSL processing to the F5 and the F5 forwards interactive requests/responses to the SharePoint front end servers via HTTP only (over a secure network). For the purposes of this discussion, iRules are parsed by a Tcl interpretation engine on the F5 Networks BIG-IP device. As such, the F5 does two things to traffic passing through it: Redirects any request to port 80 (HTTP) to port 443 (HTTPS) through HTTP 302 redirects and URL rewriting. Rewrites any response to the browser to selectively rewrite URLs embedded within the HTML so that they go to port 443 (HTTPS). This prevents the 302 redirects from breaking DHTML generated by SharePoint. We've got part 1 working fine. The main problem with part 2 is that in the response rewrite because of XML namespaces and other similar issues, not ALL matches for "http:" can be changed to "https:". Some have to remain "http:". Additionally, some of the "http:" URLs are difficult in that they live in SharePoint-generated JavaScript and their slashes (i.e. "/") are actually represented in the HTML by the UNICODE 6-character string, "\u002f". For example, in the case of these tricky ones, the literal string in the outgoing HTML is: http:\u002f\u002fservername.company.com\u002f And should be changed to: https:\u002f\u002fservername.company.com\u002f Currently we can't even figure out how to get a match in a search/replace expression on these UNICODE sequence string literals. It seems that no matter how we slice it, the Tcl interpreter is interpreting the "\u002f" string into the "/" translation before it does anything else. We've tried various combinations of Tcl escaping methods we know about (mainly double-quotes and using an extra "\" to escape the "\" in the UNICODE string) but are looking for more methods, preferably ones that work. Does anyone have any ideas or any pointers to where we can effectively self-educate about this? Thanks very much in advance.

Read the article

What are ways to prevent files with the Right-to-Left Override Unicode character in their name (a malware spoofing method) from being written or read?

- by galacticninja

What are ways to avoid or prevent files with the RLO (Right-to-Left Override) Unicode character in their name (a malware method to spoof filenames) from being written or read in a Windows PC? More info on the RLO unicode character here: http://www.fileformat.info/info/unicode/char/202e/index.htm http://en.wikipedia.org/wiki/Bi-directional_text Info on the RLO unicode character when used by malware: http://www.ipa.jp/security/english/virus/press/201110/E_PR201110.html Mirror link: http://webcache.googleusercontent.com/search?q=cache:KasmfOvbVJ8J:www.ipa.jp/security/english/virus/press/201110/E_PR201110.html+&cd=1&hl=en&ct=clnk You can try this RLO character test webpage: http://www.fileformat.info/info/unicode/char/202e/browsertest.htm The RLO character is also already pasted in the 'Input Test' field in that webpage. Try typing there and notice that the characters you're typing are coming out in their reverse orders (right-to-left, instead of left-to-right). In filenames, the RLO character can be specifically positioned in the filename to spoof or masquerade as having a filename or file extension that is different than what it actually has. (Will still be hidden even if 'Hide extensions for known filetypes' is unchecked.) The only info I can find that has info on how to prevent files with the RLO character from being run is from the Information Technology Promotion Agency, Japan website: http://www.ipa.jp/security/english/virus/press/201110/E_PR201110.html (Mirror link). They adviced to use the Local Security Policy settings manager to block files with the RLO character in its name from being run. Can anyone recommend any other good solutions to prevent files with the RLO character in their names from being written or being read in the computer, or a way to alert the user if a file with the RLO character is detected? My OS is Windows 7, but I'll be looking for solutions for Windows XP, Vista and 7, or a solution that will work for all those OSes, to help people using those OSes too.

Read the article

Can I convert an ASCII MD5 hashed password into a Unicode MD5 hashed password?

- by Jimmy Moo Moo

Hello, I'm looking for help to convert an ASCII MD5 hashed password into a Unicode MD5 hashed password? For example, I'll use the string "password" . When it's converted to an ascii byte array, I get a base64 encoded hash of X03MO1qnZdYdgyfeuILPmQ== When it's converted into a unicode byte array, I get a base64 encoded hash of sIHb6F4ew//D1OfQInQAzQ== All my passwords are stored in an md5 hash that was applied to an ascii byte array, but I'm trying to migrate my application's user data to a system that stores password in an md5 hash that is applied a unicode byte array. In case it's not clear, with the following C#code: var passwordBytes = Encoding.ASCII.GetBytes("password"); var hashAlgorithm = HashAlgorithm.Create("MD5"); var hashBytes = hashAlgorithm.ComputeHash(passwordBytes); My current system uses this, but the system I'm moving to has a diff first time. It usese Encoding.Unicode.GetBytes. Does anybody know how I can convert my passwords? From X03MO1qnZdYdgyfeuILPmQ== into sIHb6F4ew//D1OfQInQAzQ== I'm guessing the answer is that I can't.. the encoding is being done before the hashing, but I thought I'd inquire the bright minds of stackoverflow and see if anybody has a way.

Read the article

Stream/string/bytearray transformations in Python 3

- by Craig McQueen

Python 3 cleans up Python's handling of Unicode strings. I assume as part of this effort, the codecs in Python 3 have become more restrictive, according to the Python 3 documentation compared to the Python 2 documentation. For example, codecs that conceptually convert a bytestream to a different form of bytestream have been removed: base64_codec bz2_codec hex_codec And codecs that conceptually convert Unicode to a different form of Unicode have also been removed (in Python 2 it actually went between Unicode and bytestream, but conceptually it's really Unicode to Unicode I reckon): rot_13 My main question is, what is the "right way" in Python 3 to do what these removed codecs used to do? They're not codecs in the strict sense, but "transformations". But the interface and implementation would be very similar to codecs. I don't care about rot_13, but I'm interested to know what would be the "best way" to implement a transformation of line ending styles (Unix line endings vs Windows line endings) which should really be a Unicode-to-Unicode transformation done before encoding to byte stream, especially when UTF-16 is being used, as discussed this other SO question.

Read the article

Social media and special characters

- by John Paul Cook

I’ve previously blogged about using Unicode with T-SQL to put superscripts, subscripts, and special characters into text strings. Unicode is also useful in formatting social media such as Facebook, Twitter, and that dinosaur otherwise known as email. When you can’t set properties of text such as italicizing the subject line of an email message or adding subscripts to a Facebook post, Unicode can make it possible. There are Unicode characters that are intrinsically italicized. Others are intrinsically...(read more)

Read the article

MFC: what would be the regex to check if a character is unicode or not?

- by Owen

Hi All, I'm trying to use windows' API IsTextUnicode to check if a character input is unicode or not, but is sort of buggy. I figured, it might be better using a regex. However, I'm new to constructing regular expressions. What would be the regex to check if a character is unicode or not? Thanks...

Search Results

Search found 1649 results on 66 pages for 'unicode normalization'.

Page 29/66 | < Previous Page | 25 26 27 28 29 30 31 32 33 34 35 36 | Next Page >

- by Andrey

- by Mikk

- by duhaime

- by Rhubarb

- by Chris

- by Ruxuan Ouyang

- by mackstann

- by knorv

- by Alex R

- by Alex R

- by Jon

- by Avishay

- by knorv

- by Vulcan Eager

- by zeckdude

- by Dan W

- by Paul Tarjan

- by joshcomley

- by Remus Rigo

- by openid.malcolmgin.com

- by galacticninja

- by Jimmy Moo Moo

- by Craig McQueen

- by John Paul Cook

- by Owen

< Previous Page | 25 26 27 28 29 30 31 32 33 34 35 36 | Next Page >