Search Results

Search found 1474 results on 59 pages for 'unicode'.

Page 5/59 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Track unicode words from Twitter using Ruby and the Tweetstream API

    - by Régis B.
    I am trying to track a set of keywords from Twitter by using the Streaming API (can't post the link here because of spam limitations: google twitter streaming API). I am doing this inside Ruby, using the TweetStream gem: http://bit.ly/cODAWI The problem I have is that I want to track keywords that contain some unicode/UTF-8 characters. For instance: require 'rubygems' require 'tweetstream' TweetStream::Client.new("my_user_name", "my_password").track("é") do |s| puts s.text end (you can try it out, provided you installed the tweetstream and json gems) This piece of code does not print anything, while replacing "é" with "e" outputs a bunch of tweets continuously. I did not find any reliable documentation about Unicode in Ruby, so I have no idea where the problem comes from. Thanks for your help!

    Read the article

  • Unicode data from NSData to NSString

    - by Jeff
    So if I have NSData from an HTTP request, then I do something like this: NSString *test = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding]; This will result in null if the data contains weird unicode data (title is from reddit): {"title":"click..¦¦me..and..then¦¦________ ¦¦check¦¦_.your...¦¦.__...¦¦____ ¦¦....¦¦¦¦¦¦¦¦¦¦¦¦¦¦....¦¦____ ¦¦¦¦¦¦....¦¦¦¦¦¦....¦¦¦¦¦¦____ ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦____ ....¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦______ ........¦¦..._recently....¦¦________ ....¦¦....viewed....links....¦¦_____"}, How would I convert the data to a string? Ideally, it would best if the string wasn't null so I could parse it as JSON, but even a lossy conversion is fine with me in these cases. I'm not familiar with unicode (naive American I am), so any enlightenment about that would be a nice bonus :)

    Read the article

  • Unicode and URI encoding, decoding and escaping in JavaScript

    - by apphacker
    If you look at this table here, it has a list of escape sequences for Unicode characters that don't actually work for me. For example for "%96", which should be a –, I get an error when trying decode: decodeURIComponent("%96"); URIError: URI malformed If I attempt to encode "–" I actually get: encodeURIComponent("–"); "%E2%80%93" I searched through the internet and I saw this page, which mentions using escape and unescape with decodeURIComponent and encodeURIComponent respectively. This doesn't seem to help because %96 doesn't show up as "–" no matter what I try and this of course wouldn't work: decodeURIComponent(escape("%96)); "%96" Not very helpful. How can I get "%96" to be a "–" with JavaScript (without hardcoding a map for every single possible unicode character I may run into)?

    Read the article

  • Regular expressions in python unicode

    - by Remy
    I need to remove all the html tags from a given webpage data. I tried this using regular expressions: import urllib2 import re page = urllib2.urlopen("http://www.frugalrules.com") from bs4 import BeautifulSoup, NavigableString, Comment soup = BeautifulSoup(page) link = soup.find('link', type='application/rss+xml') print link['href'] rss = urllib2.urlopen(link['href']).read() souprss = BeautifulSoup(rss) description_tag = souprss.find_all('description') content_tag = souprss.find_all('content:encoded') print re.sub('<[^>]*>', '', content_tag) But the syntax of the re.sub is: re.sub(pattern, repl, string, count=0) So, I modified the code as (instead of the print statement above): for row in content_tag: print re.sub(ur"<[^>]*>",'',row,re.UNICODE But it gives the following error: Traceback (most recent call last): File "C:\beautifulsoup4-4.3.2\collocation.py", line 20, in <module> print re.sub(ur"<[^>]*>",'',row,re.UNICODE) File "C:\Python27\lib\re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer What am I doing wrong?

    Read the article

  • Output Unicode to Console Using C++

    - by Jesse Foley
    I'm still learning C++, so bear with me and my sloppy code. The compiler I use is Dev C++. I want to be able to output Unicode characters to the Console using cout. Whenver i try things like: # #include directive here (include iostream) using namespace std; int main() { cout << "Hello World!\n"; cout << "Blah blah blah some gibberish unicode: ÐAßGg\n"; system("PAUSE"); return 0; } It outputs strange characters to the console, like µA¦Gg. Why does it do that, and how can i get to to display ÐAßGg? Or is this not possible with Windows?

    Read the article

  • What .NET UnmanagedType is Unicode (UTF-16)?

    - by Pat
    I am packing bytes into a struct, and some of them correspond to a Unicode string. The following works fine for an ASCII string: [StructLayout(LayoutKind.Sequential)] private struct PacketBytes { [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 64)] public string MyString; } I assumed that I could do [StructLayout(LayoutKind.Sequential)] private struct PacketBytes { [MarshalAs(UnmanagedType.LPWStr, SizeConst = 32)] public string MyString; } to make it Unicode, but that didn't work. (Since this field is part of a struct with other fields, which I've omitted for clarity, I can't simply change the CharSet of the containing struct.) Any idea what I'm doing wrong?

    Read the article

  • IE cannot download file with unicode pathname

    - by MM
    I have a web-app that allows users to upload and download image files by pressing buttons on a web page. A user of this page is reporting that IE 7 and 8 fail to download files when the files have Unicode pathnames. IE prompts the user with a dialog stating: "Internet explorer cannot download (file) at (webserver).". Unfortunately I have not been able to reproduce the problem using these versions on my machine. My question is, what could cause this, and how can I prevent it from happening? I have read about problems with cache control (I currently have it set to no-cache); however, I am not using HTTP-S, and the problem only occurs with file-names containing Unicode characters.

    Read the article

  • How to convert unicode character to its escaped ascii equivalent in c#

    - by Grant
    Hi, i am beginning with a string containing an encoded unicode character "& #xfc;". I pass the string to an object that performs some logic and returns another string. That string is converting the original encoded character to its unicode equivalent "ü". I need to get the original encoded character back but so far am not able. I have tried using the HttpUtility.HtmlEncode() method but that is returning "& #252;" which is not the same. Can anyone help?

    Read the article

  • c# unicode string output

    - by Reg
    I have function to convert string to a Unicode string: private string UnicodeString(string text) { return Encoding.UTF8.GetString(Encoding.ASCII.GetBytes(text)); } But when I am calling this function the output result is wrong. It looks like my function is not working. Console.WriteLine(UnicodeString("????? ?????")) printing on console just questions like that: ????? ???? Is there any way to say to console to display it correct? UPDATE Looks like the problem not in Unicode, I think may be it is displaying question marks because i am not having correct locale in the system (Windows 7)? Is there any way to make it work without changing locale?

    Read the article

  • Python - pyparsing unicode characters

    - by mgj
    Hi..:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8) The code specifies the grammar and parses accordingly. 671.assess :: ????? ::2 x=number + "." + src + "::" + w + "::" + number + "." + number If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format. I mean that the code works when we have something of the form 671.assess :: ahsaas ::2 i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose. The python code looks like this: # -*- coding: utf-8 -*- from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , # grammar src = Word(printables) trans = Word(printables) number = Word(nums) x=number + "." + src + "::" + trans + "::" + number + "." + number #parsing for eng-dict efiledata = open('b1aop_or_not_word.txt').read() eresults = x.parseString(efiledata) edict1 = {} edict2 = {} counter=0 xx=list() for result in eresults: trans=""#translation string ew=""#english word xx=result[0] ew=xx[2] trans=xx[4] edict1 = { ew:trans } edict2.update(edict1) print len(edict2) #no of entries in the english dictionary print "edict2 has been created" print "english dictionary" , edict2 #parsing for hin-dict hfiledata = open('b1aop_or_not_word.txt').read() hresults = x.scanString(hfiledata) hdict1 = {} hdict2 = {} counter=0 for result in hresults: trans=""#translation string hw=""#hin word xx=result[0] hw=xx[2] trans=xx[4] #print trans hdict1 = { trans:hw } hdict2.update(hdict1) print len(hdict2) #no of entries in the hindi dictionary print"hdict2 has been created" print "hindi dictionary" , hdict2 ''' ####################################################################################################################### def translate(d, ow, hinlist): if ow in d.keys():#ow=old word d=dict print ow , "exists in the dictionary keys" transes = d[ow] transes = transes.split() print "possible transes for" , ow , " = ", transes for word in transes: if word in hinlist: print "trans for" , ow , " = ", word return word return None else: print ow , "absent" return None f = open('bidir','w') #lines = ["'\ #5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0 \ #5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0 \ #'"] data=open('bi_full_2','rb').read() lines = data.split('!@#$%') loc=0 for line in lines: eng, hin = [subline.split(' # ') for subline in line.strip('\n').split('\n')] for transdict, source, dest in [(edict2, eng, hin), (hdict2, hin, eng)]: sourcethings = source[2].split() for word in source[1].split(): tl = dest[1].split() otherword = translate(transdict, word, tl) loc = source[1].split().index(word) if otherword is not None: otherword = otherword.strip() print word, ' <-> ', otherword, 'meaning=good' if otherword in dest[1].split(): print word, ' <-> ', otherword, 'trans=good' sourcethings[loc] = str( dest[1].split().index(otherword) + 1) source[2] = ' '.join(sourcethings) eng = ' # '.join(eng) hin = ' # '.join(hin) f.write(eng+'\n'+hin+'\n\n\n') f.close() ''' if an example input sentence for the source file is: 1# 5 # modern markets : confident consumers # 0 0 0 0 0 1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0 !@#$% the ouptut would look like this :- 1# 5 # modern markets : confident consumers # 1 2 3 4 5 1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0 !@#$% Output Explanation:- This achieves bidirectional alignment. It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop. The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping. What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.

    Read the article

  • problem using getline with a unicode file

    - by hamishmcn
    UPDATE: Thank you to @Potatoswatter and @Jonathan Leffler for comments - rather embarrassingly I was caught out by the debugger tool tip not showing the value of a wstring correctly - however it still isn't quite working for me and I have updated the question below: If I have a small multibyte file I want to read into a string I use the following trick - I use getline with a delimeter of '\0' e.g. std::string contents_utf8; std::ifstream inf1("utf8.txt"); getline(inf1, contents_utf8, '\0'); This reads in the entire file including newlines. However if I try to do the same thing with a wide character file it doesn't work - my wstring only reads to the the first line. std::wstring contents_wide; std::wifstream inf2(L"ucs2-be.txt"); getline( inf2, contents_wide, wchar_t(0) ); //doesn't work For example my if unicode file contains the chars A and B seperated by CRLF, the hex looks like this: FE FF 00 41 00 0D 00 0A 00 42 Based on the fact that with a multibyte file getline with '\0' reads the entire file I believed that getline( inf2, contents_wide, wchar_t(0) ) should read in the entire unicode file. However it doesn't - with the example above my wide string would contain the following two wchar_ts: FF FF (If I remove the wchar_t(0) it reads in the first line as expected (ie FE FF 00 41 00 0D 00) Why doesn't wchar_t(0) work as a delimiting wchar_t of "00 00"? Thank you

    Read the article

  • Allowed unicode characters in IDN host labels

    - by Roland Franssen
    Hi all, Im currently working on a "proper" URI validator and currently it all comes down to hostname validation, the rest isnt that tricky. Im stuck at IDN hostname labels (e.g. containing unicode; possible punycode encoded strings have been decoded at this point). My first idea was basicly a regex for TLD's not supporting IDN and one for those who do (http://www.mozilla.org/projects/security/tld-idn-policy-list.html (?)). Respectively; ^[a-zA-Z0-9-]+$ and ^[a-zA-Z0-9-\p{L}]+$ However this is not an ideal situation, since every IDN registrar can decide which characters to allow and which not. What im looking for is a proper, consistent, up2date data table of unicode characters allowed in various TLD's; im getting this idea i have to find all the data myself at russian and chinese registry sites (which is quite difficult). So before spitting down the web.. i wondered is there such a list? Or are there better approaches, best/common practices etc? (I want the validation to be as strict as possible.) Any help is welcome! // Roland

    Read the article

  • iPhone app rejection for using ICU (Unicode extensions)

    - by nickbit
    I received the following mail form Apple, considering my application: *Thank you for submitting your update to ??µ??es?a to the App Store. During our review of your application we found it is using private APIs, which is in violation of the iPhone Developer Program License Agreement section 3.3.1; "3.3.1 Applications may only use Documented APIs in the manner prescribed by Apple and must not use or call any private APIs." While your application has not been rejected, it would be appropriate to resolve this issue in your next update. The following non-public APIs are included in your application: u_isspace ubrk_close ubrk_current ubrk_first ubrk_next ubrk_open If you have defined methods in your source code with the same names as the above mentioned APIs, we suggest altering your method names so that they no longer collide with Apple's private APIs to avoid your application being flagged with future submissions. Please resolve this issue in your next update to ??µ??es?a. Sincerely, iPhone App Review Team* The functions mentioned in this mail are used in the ICU library (International Components for Unicode). Although my app is not rejected at this point, I don't feel very secure for the future of my app, because it relies heavily on the Unicode protocol and on this components in particular. Another thing is that I do not call these functions directly, but they are called by a custom 'sqlite' build (with FTS3 extensions enabled). Am I missing something here? Any suggestions?

    Read the article

  • latin1/unicode conversion problem with ajax request and special characters

    - by mfn
    Server is PHP5 and HTML charset is latin1 (iso-8859-1). With regular form POST requests, there's no problem with "special" characters like the em dash (–) for example. Although I don't know for sure, it works. Probably because there exists a representable character for the browser at char code 150 (which is what I see in PHP on the server for a literal em dash with ord). Now our application also provides some kind of preview mechanism via ajax: the text is sent to the server and a complete HTML for a preview is sent back. However, the ordinary char code 150 em dash character when sent via ajax (tested with GET and POST) mutates into something more: %E2%80%93. I see this already in the apache log. According to various sources I found, e.g. http://www.tachyonsoft.com/uc0020.htm , this is the UTF8 byte representation of em dash and my current knowledge is that JavaScript handles everything in Unicode. However within my app, I need everything in latin1. Simply said: just like a regular POST request would have given me that em dash as char code 150, I would need that for the translated UTF8 representation too. That's were I'm failing, because with PHP on the server when I try to decode it with either utf8_decode(...) or iconv('UTF-8', 'iso-8859-1', ...) but in both cases I get a regular ? representing this character (and iconv also throws me a notice: Detected an illegal character in input string ). My goal is to find an automated solution, but maybe I'm trying to be überclever in this case? I've found other people simply doing manual replacing with a predefined input/output set; but that would always give me the feeling I could loose characters. The observant reader will note that I'm behind on understanding the full impact/complexity with things about Unicode and conversion of chars and I definitely prefer to understand the thing as a whole then a simply manual mapping. thanks

    Read the article

  • Reading Unicode files line by line C++

    - by Roger Nelson
    What is the correct way to read Unicode files line by line in C++? I am trying to read a file saved as Unicode (LE) by Windows Notepad. Suppose the file contains simply the characters A and B on separate lines. In reading the file byte by byte, I see the following byte sequence (hex) : FE FF 41 00 0D 00 0A 00 42 00 0D 00 0A 00 So 2 byte BOM, 2 byte 'A', 2byte CR , 2byte LF, 2 byte 'B', 2 byte CR, 2 byte LF . I tried reading the text file using the following code: std::wifstream file("test.txt"); file.seekg(2); // skip BOM std::wstring A_line; std::wstring B_line; getline(file,A_line); // I get "A" getline(file,B_line); // I get "\0B" I get the same results using operator instead of getline file >> A_line; file >> B_line; It appears that the single byte CR character is is being consumed only as the single byte. or CR NULL LF is being consumed but not the high byte NULL. I would expect wifstream in text mode would read the 2byte CR and 2byte LF. What am I doing wrong? It does not seem right that one should have to read a text file byte by byte in binary mode just to parse the new lines.

    Read the article

  • Using JavaMail to send a mail containing Unicode characters

    - by NoozNooz42
    I'm successfully sending emails through GMail's SMTP servers using the following piece of code: Properties props = new Properties(); props.put("mail.smtp.host", "smtp.gmail.com"); props.put("mail.smtp.socketFactory.port", "465"); props.put("mail.smtp.socketFactory.class","javax.net.ssl.SSLSocketFactory"); props.put("mail.smtp.auth", "true"); props.put("mail.smtp.port", "465"); props.put("mail.smtp.ssl", "true"); props.put("mail.smtp.starttls.enable","true"); props.put("mail.smtp.timeout", "5000"); props.put("mail.smtp.connectiontimeout", "5000"); // Do NOT use Session.getDefaultInstance but Session.getInstance // See: http://forums.sun.com/thread.jspa?threadID=5301696 final Session session = Session.getInstance( props, new javax.mail.Authenticator() { protected PasswordAuthentication getPasswordAuthentication() { return new PasswordAuthentication( USER, PWD ); } }); try { final Message message = new MimeMessage(session); message.setFrom( new InternetAddress( USER ) ); message.setRecipients( Message.RecipientType.TO, InternetAddress.parse( TO ) ); message.setSubject( emailSubject ); message.setText( emailContent ); Transport.send(message); emailSent = true; } catch ( final MessagingException e ) { e.printStackTrace(); } where emailContent is a String that does contain Unicode characters (like the euro symbol). When the email arrives (in another GMail account), the euro symbol has been converted to the ASCII '?' question mark. I don't know much about emails: can email use any character encoding? What should I modify in the code above so that an encoding allowing Unicode characters is used?

    Read the article

  • Cross-platform iteration of Unicode string

    - by kizzx2
    I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme). Example The text "??????" is comprised of the code points: U+0928, U+092E, U+0938, U+094D, U+0924, U+0947, of which, U+0938 and U+0947 are combining marks. static void Main(string[] args) { const string s = "??????"; Console.WriteLine(s.Length); // Ouptuts "6" var l = 0; var e = System.Globalization.StringInfo.GetTextElementEnumerator(s); while(e.MoveNext()) l++; Console.WriteLine(l); // Outputs "4" } So there we have it in .NET. We also have Win32's CharNextW() #include <Windows.h> #include <iostream> #include <string> int main() { const wchar_t * s = L"??????"; std::cout << std::wstring(s).length() << std::endl; // Gives "6" int l = 0; while(CharNextW(s) != s) { s = CharNextW(s); ++l; } std::cout << l << std::endl; // Gives "4" return 0; } Question Both ways I know of are specific to Microsoft. Are there portable ways to do it? I heard about ICU but I couldn't find something related quickly (UnicodeString(s).length() still gives 6). Would be an acceptable answer to point to the related function/module in ICU. C++ doesn't have a notion of Unicode, so a lightweight cross-platform library for dealing with these issues would make an acceptable answer.

    Read the article

  • Convert Unicode char to closest (most similar) char in ASCII (.net)

    - by Andrey
    Hi all! Do you have any idea how to covert different Unicode characters to their closest ASCII equivalents? Like Ä - A. A googled but didn't find any suitable solution. Trick Encoding.ASCII.GetBytes("Ä")[0] didn't work. (Result was ?). I found that there is class Encoder that has Fallback property that is exactly for cases when char can't be converted, but implementations (EncoderReplacementFallback) are stupid and convert to ?. Any ideas? Thanks, Andrey

    Read the article

  • SHA-1 and Unicode

    - by Andrew
    Hi everyone, Is behavior of SHA-1 algorithm defined for Unicode strings? I do realize that SHA-1 itself does not care about the content of the string, however, it seems to me that in order to pass standard tests for SHA-1, the input string should be encoded with UTF-8.

    Read the article

  • Insert unicode strings into CleverCSS

    - by Brian M. Hunt
    How can one insert a Unicode string CSS into CleverCSS? In particular, how could one produce the following CSS using CleverCSS: li:after { content: "\00BB \0020"; } I've figured out CleverCSS's parsing rules, but suffice that the permutations I've thought sensible have failed, for example: li: content: "\\00BB \\0020" // becomes content: 'BB 0' EDIT: My other examples and the rest of my post weren't saved. Suffice that I had a longer list of examples that also failed, as did my closing which was something like: I'd be grateful for any thoughts and input. Brian

    Read the article

  • Need a simple tool for analysing unicode characters

    - by Steve Bennett
    I'm surprised I can't find a simple tool for this. Basically, sometimes as a result of text munging, or using some piece of software, I end up with some text that has some troublesome characters - such as looking a lot like other characters, but being distinct from them. I'd like a tool (preferably online, javascript based) where I can paste the text, and it will tell me all the characters involved, their names, unicode codes etc.

    Read the article

  • Convert or strip out "illegal" Unicode characters

    - by Oli
    I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database. However for some characters, it explodes. I get complaints like this: UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128) Is there some way I can convert the chars to proper unicode versions? Or strip them out?

    Read the article

  • lxml unicode entity parse problems

    - by Jon Hadley
    I'm using lxml as follows to parse an exported XML file from another system: xmldoc = open(filename) etree.parse(xmldoc) But im getting: lxml.etree.XMLSyntaxError: Entity 'eacute' not defined, line 4495, column 46 Obviously it's having problems with unicode entity names - but how would i get round this? Via open() or parse()?

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >