Search Results

Search found 1649 results on 66 pages for 'unicode normalization'.

Page 14/66 | < Previous Page | 10 11 12 13 14 15 16 17 18 19 20 21  | Next Page >

  • getTextContent from Node with whitespace character normalization

    - by Nayn
    Hi, I am working with XPATH, Java and want to extract some text out of one html page. The text is located under some div with some whitespace characters in between, like &nbsp; <br> etc. I want these to be converted into 'space' and 'newline' respectively while extracting. The method I am using to extract text is Element.getTextContent() which does not respect whitespace characters. Could somebody tell me if there is a way to extract text with whitespace normalization OR Extract whole html markup under the 'Node' so that i could replace it by myself. Thanks Nayn

    Read the article

  • Manipulating both unicode and ASCII character set in C#

    - by Murlex
    I have this mapping in my C# application string [,] unicode2Ascii = { { "&#3001;", "\x86" } }; ஹ - is the unicode value for a tamil literal "ஹ". This is the raw hex literal for the unicode value saved by MS Word as a byte sequence. I am trying to map these unicode value "strings" to a hex value under 255 (so as to accommodate non-unicode supported systems). I trying to use string.replace like this: S = S.replace(unicode2Ascii[0,0], unicode2Ascii[0,1]); However the resultant ouput has a ? instead of the actual hex 0x86 stored. Any pointer on how I could set the encoding for the second element of that array to something like windows-1252? Or is there a better way to do this conversion? thanks in advance

    Read the article

  • jquery post work with hebrew (unicode) but not with spaces, get not working with hebrew but does spa

    - by Y.G.J
    i have tried load and hebrew didn't work for me so i changed my code to $.ajax({ type: "post", url: "process-tb.asp", data: data, success: function(msg) (partial code) not knowing that post and get is the problem for my hebrew querystring. so know i can get my page to get the hebrew and english but no spaces are add to the text. all pages are encoded to utf-8. what is wrong with it?

    Read the article

  • Delphi 10, .NET, how do I convert a hex UTF-8 string to its unicode character?

    - by Evan V.
    Hi all, I am trying to make my web app compatible with international languages and I am stuck with trying to convert escaped characters in my Delphi .NET DLL. The front end code is passing the UTF-8 hex notation with an escape character e.g for ? I pass \uE3818A. In my DLL I capture this and constract the following string '$E3828A'. I need to convert this back to ? and send it to my database, I've been trying to use Encoding.UTF8.GetBytes and Encoding.UTF8.GetString but with no luck. Anyone could help me figure this out? Thank you.

    Read the article

  • How to use unicode inside an xpath string? (UnicodeEncodeError)

    - by Gj
    I'm using xpath in Selenium RC via the Python api. I need to click an a element who's text is "Submit »" Here's the error that I'm getting: In [18]: sel.click(u"xpath=//a[text()='Submit \xbb')]") ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (1121, 0)) --------------------------------------------------------------------------- Exception Traceback (most recent call last) /Users/me/<ipython console> in <module>() /Users/me/selenium.pyc in click(self, locator) 282 'locator' is an element locator 283 """ --> 284 self.do_command("click", [locator,]) 285 286 /Users/me/selenium.pyc in do_command(self, verb, args) 213 #print "Selenium Result: " + repr(data) + "\n\n" 214 if (not data.startswith('OK')): --> 215 raise Exception, data 216 return data 217 <type 'str'>: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u"ERROR: Invalid xpath [2]: //a[text()='Submit \xbb')]", 45, 46, 'ordinal not in range(128)'))

    Read the article

  • How to use Unicode characters in a vim script?

    - by Thomas
    I'm trying to get vim to display my tabs as ? so they cannot be mistaken for actual characters. I'd hoped the following would work: if has("multi_byte") set lcs=tab:? else set lcs=tab:>- endif However, this gives me E474: Invalid argument: lcs=tab:? The file is UTF-8 encoded and includes a BOM. Googling "vim encoding" or similar gives me many results about the encoding of edited files, but nothing about the encoding of executed scripts. How to get this character into my .vimrc so that it is properly displayed?

    Read the article

  • What is the proper way to URL encode Unicode characters?

    - by Josh Gibson
    I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C. Some interesting examples: The heart character. If I type this into my browser: http://www.google.com/search?q=? Then copy and paste it, I see this URL http://www.google.com/search?q=%E2%99%A5 which makes it seem like Firefox (or Safari) is doing this. urllib.quote_plus(x.encode("latin-1")) '%E2%99%A5' which makes sense, except for things that can't be encoded in Latin-1, like the triple dot character. … If I type the URL http://www.google.com/search?q=… into my browser then copy and paste, I get http://www.google.com/search?q=%E2%80%A6 back. Which seems to be the result of doing urllib.quote_plus(x.encode("utf-8")) which makes sense since … can't be encoded with Latin-1. But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1. Since this seems to be ambiguous: In [67]: u"…".encode('utf-8').decode('latin-1') Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6' works, so I don't know how the browser figures out whether to decode that with UTF-8 or Latin-1. What's the right thing to be doing with the special characters I need to deal with?

    Read the article

  • Unicode special characters - Dingbats - Appear differently in Firefox vs. Chrome/IE

    - by Oren
    Hi there, I'm trying to find a way to make dingbats appear exactly the same in Firefox, Chrome, Safari and IE. I noticed that the Dingbats appear the same in IE/Chrome/Safari, HOWEVER - in Firefox - they look "thiner". For example - try to visit the following page: http://en.wikipedia.org/wiki/Dingbat You'll notice that when viewing that page in Firefox - the characters look different in comparison to Chrome/IE. Does anybody know why and how can I cause Firefox to display the characters EXACTLY like they appear in Chrome/IE? Thx, Oren.

    Read the article

  • Is it possible to reliably auto-decode user files to Unicode? [C#]

    - by NVRAM
    I have a web application that allows users to upload their content for processing. The processing engine expects UTF8 (and I'm composing XML from multiple users' files), so I need to ensure that I can properly decode the uploaded files. Since I'd be surprised if any of my users knew their files even were encoded, I have very little hope they'd be able to correctly specify the encoding (decoder) to use. And so, my application is left with task of detecting before decoding. This seems like such a universal problem, I'm surprised not to find either a framework capability or general recipe for the solution. Can it be I'm not searching with meaningful search terms? I've implemented BOM-aware detection (http://en.wikipedia.org/wiki/Byte_order_mark) but I'm not sure how often files will be uploaded w/o a BOM to indicate encoding, and this isn't useful for most non-UTF files. My questions boil down to: Is BOM-aware detection sufficient for the vast majority of files? In the case where BOM-detection fails, is it possible to try different decoders and determine if they are "valid"? (My attempts indicate the answer is "no.") Under what circumstances will a "valid" file fail with the C# encoder/decoder framework? Is there a repository anywhere that has a multitude of files with various encodings to use for testing? While I'm specifically asking about C#/.NET, I'd like to know the answer for Java, Python and other languages for the next time I have to do this. So far I've found: A "valid" UTF-16 file with Ctrl-S characters has caused encoding to UTF-8 to throw an exception (Illegal character?) (That was an XML encoding exception.) Decoding a valid UTF-16 file with UTF-8 succeeds but gives text with null characters. Huh? Currently, I only expect UTF-8, UTF-16 and probably ISO-8859-1 files, but I want the solution to be extensible if possible. My existing set of input files isn't nearly broad enough to uncover all the problems that will occur with live files. Although the files I'm trying to decode are "text" I think they are often created w/methods that leave garbage characters in the files. Hence "valid" files may not be "pure". Oh joy. Thanks.

    Read the article

  • Is it safe to use random Unicode for complex delimiter sequences in strings?

    - by ccomet
    Question: In terms of program stability and ensuring that the system will actually operate, how safe is it to use chars like ¦, § or ‡ for complex delimiter sequences in strings? Can I reliable believe that I won't run into any issues in a program reading these incorrectly? I am working in a system, using C# code, in which I have to store a fairly complex set of information within a single string. The readability of this string is only necessary on the computer side, end-users should only ever see the information after it has been parsed by the appropriate methods. Because some of the data in these strings will be collections of variable size, I use different delimiters to identify what parts of the string correspond to a certain tier of organization. There are enough cases that the standard sets of ;, |, and similar ilk have been exhausted. I considered two-char delimiters, like ;# or ;|, but I felt that it would be very inefficient. There probably isn't that large of a performance difference in storing with one char versus two chars, but when I have the option of picking the smaller option, it just feels wrong to pick the larger one. So finally, I considered using the set of characters like the double dagger and section. They only take up one char, and they are definitely not going to show up in the actual text that I'll be storing, so they won't be confused for anything. But character encoding is finicky. While the visibility to the end user is meaningless (since they, in fact, won't see it), I became recently concerned about how the programs in the system will read it. The string is stored in one database, while a separate program is responsible for both encoding and decoding the string into different object types for the rest of the application to work with. And if something is expected to be written one way, is possibly written another, then maybe the whole system will fail and I can't really let that happen. So is it safe to use these kind of chars for background delimiters?

    Read the article

  • How do I output Unicode characters as a pair of ASCII characters?

    - by ChrisF
    How do I convert (as an example): Señor Coconut Y Su Conjunto - Introducciõn to: Señor Coconut Y Su Conjunto - Introducciõn I've got an app that creates m3u playlists, but when the track filename, artist or title contains non ASCII characters it doesn't get read properly by the music player so the track doesn't get played. I've discovered that if I write the track out as: #EXTINFUTF8:76,Señor Coconut Y Su Conjunto - Introducciõn #EXTINF:76,Señor Coconut Y Su Conjunto - Introducciõn #UTF8:01-Introducciõn.mp3 01-Introducciõn.mp3 Then the music player will read it correctly and play the track. My problem is that I can't find the information I need to be able to do the conversion properly.

    Read the article

  • Does Postgresql varchar count using unicode character length or ASCII character length?

    - by bennylope
    I tried importing a database dump from a SQL file and the insert failed when inserting the string Mér into a field defined as varying(3). I didn't capture the exact error, but it pointed to that specific value with the constraint of varying(3). Given that I considered this unimportant to what I was doing at the time, I just changed the value to Mer, it worked, and I moved on. Is a varying field with its limit taking into account length of the byte string? What really boggles my mind is that this was dumped from another PostgreSQL database. So it doesn't make sense how a constraint could allow the value to be written initially.

    Read the article

  • How to test if a string has a certain unicode char?

    - by Ruben Trancoso
    Supose you have a command line executable that receives arguments. This executalbe is widechar ready and you want to test if one of this arguments starts with an HYPHEN case in which its an option: command -o foo how you could test it inside your code if you don't know the charset been used by the host? Should be not possible to a given console to produce the same HYPHEN representation by another char in the widechar forest? (in such case it would be a wild char :P) int _tmain(int argc, _TCHAR* argv[]) { std::wstring inputFile(argv[1]); if(inputFile->c_str() <is an HYPHEN>) { _tprintf(_T("First argument cannot be an option")); } }

    Read the article

  • Storing and displaying unicode string (??????) using PHP and MySQL

    - by Anirudh Goel
    I have to store hindi text in a MySQL database, fetch it using a PHP script and display it on a webpage. I did the following: I created a database and set its encoding to UTF-8 and also the collation to utf8_bin. I added a varchar field in the table and set it to accept UTF-8 text in the charset property. Then I set about adding data to it. Here I had to copy data from an existing site. The hindi text looks like this: ????????:05:30 I directly copied this text into my database and used the PHP code echo(utf8_encode($string)) to display the data. Upon doing so the browser showed me "??????". When I inserted the UTF equivalent of the text by going to "view source" in the browser, however, ???????? translates into &#2360;&#2370;&#2352;&#2381;&#2351;&#2379;&#2342;&#2351;. If I enter and store &#2360;&#2370;&#2352;&#2381;&#2351;&#2379;&#2342;&#2351; in the database, it converts perfectly. So what I want to know is how I can directly store ???????? into my database and fetch it and display it in my webpage using PHP. Also, can anyone help me understand if there's a script which when I type in ????????, gives me &#2360;&#2370;&#2352;&#2381;&#2351;&#2379;&#2342;&#2351;? Solution Found I wrote the following sample script which worked for me. Hope it helps someone else too <html> <head> <title>Hindi</title></head> <body> <?php include("connection.php"); //simple connection setting $result = mysql_query("SET NAMES utf8"); //the main trick $cmd = "select * from hindi"; $result = mysql_query($cmd); while ($myrow = mysql_fetch_row($result)) { echo ($myrow[0]); } ?> </body> </html> The dump for my database storing hindi utf strings is CREATE TABLE `hindi` ( `data` varchar(1000) character set utf8 collate utf8_bin default NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1; INSERT INTO `hindi` VALUES ('????????'); Now my question is, how did it work without specifying "META" or header info? Thanks!

    Read the article

  • Spelling correction for data normalization in Java

    - by dareios
    I am looking for a Java library to do some initial spell checking / data normalization on user generated text content, imagine the interests entered in a Facebook profile. This text will be tokenized at some point (before or after spell correction, whatever works better) and some of it used as keys to search for (exact match). It would be nice to cut down misspellings and the like to produce more matches. It would be even better if the correction would perform well on tokens longer than just one word, e.g. "trinking coffee" would become "drinking coffee" and not "thinking coffee". I found the following Java libraries for doing spelling correction: JAZZY does not seem to be under active development. Also, the dictionary-distance based approach seems inadequate because of the use of non-standard language in social network profiles and multi-word tokens. APACHE LUCENE seems to have a statistical spell checker that should be much more suited. Question here would how to create a good dictionary? (We are not using Lucene otherwise, so there is no existing index.) Any suggestions are welcome!

    Read the article

  • De-normalization alternative to specific MYSQL problem?

    - by Booker
    I am facing quite a specific optimization problem. I currently have 4 normalized tables of data. Every second, possibly thousands of users will pull down up-to-date info from these tables using AJAX. The thing is that I can predict relatively easily which subset of data they need... The most recent 100 or so entries in those 4 normalized tables. I have been researching de-normalization... but feel that perhaps there is an easier solution. I was thinking that I could somehow every second run one sql query to condense the needed info, store it in a temp cached table and then have all of the user queries just draw from this. This will allow the complex join of 4 tables to only be run once, and then from there the users just need to do a simple lookup from the cached table. I really don't know if this is feasible. Comments on this or any other suggestions would be much appreciated. Thanks!

    Read the article

  • What feature is at play when Ctrl+Shift+Alt+U,E "types" an unprintable hex 000E?

    - by Peter.O
    I tend to use Ctrl+Shift+Alt for my customized system-wide keybindings. When I tried Ctrl+Shift+Alt+U it printed an underscored u and waited for more keyboard input!... Some keys were accepted and some were not... eg. Numbers were accepted and they too were underlined, but only a few keys allowed me to break out. I then tried Ctrl+Shift+Alt+U immediately followed by Ctrl+Shift+Alt+E. This produced an unprintable hex 000E(?) and broke out of the loop... The unprintable character got me thinking that this may be Unicode related. If so, how so? What is happening here? Is this underscored u a trigger for an Input Method Editor? This behaviour occurs: Here (as I type), "gedit", text-edit fields... (but not in the Terminal)... and "gvim" reported "pattern not found"...

    Read the article

< Previous Page | 10 11 12 13 14 15 16 17 18 19 20 21  | Next Page >