gap - Page 25 - Developer IT

What is the fastest cyclic synchronization in Java (ExecutorService vs. CyclicBarrier vs. X)?

- by Alex Dunlop

Which Java synchronization construct is likely to provide the best performance for a concurrent, iterative processing scenario with a fixed number of threads like the one outlined below? After experimenting on my own for a while (using ExecutorService and CyclicBarrier) and being somewhat surprised by the results, I would be grateful for some expert advice and maybe some new ideas. Existing questions here do not seem to focus primarily on performance, hence this new one. Thanks in advance! The core of the app is a simple iterative data processing algorithm, parallelized to the spread the computational load across 8 cores on a Mac Pro, running OS X 10.6 and Java 1.6.0_07. The data to be processed is split into 8 blocks and each block is fed to a Runnable to be executed by one of a fixed number of threads. Parallelizing the algorithm was fairly straightforward, and it functionally works as desired, but its performance is not yet what I think it could be. The app seems to spend a lot of time in system calls synchronizing, so after some profiling I wonder whether I selected the most appropriate synchronization mechanism(s). A key requirement of the algorithm is that it needs to proceed in stages, so the threads need to sync up at the end of each stage. The main thread prepares the work (very low overhead), passes it to the threads, lets them work on it, then proceeds when all threads are done, rearranges the work (again very low overhead) and repeats the cycle. The machine is dedicated to this task, Garbage Collection is minimized by using per-thread pools of pre-allocated items, and the number of threads can be fixed (no incoming requests or the like, just one thread per CPU core). V1 - ExecutorService My first implementation used an ExecutorService with 8 worker threads. The program creates 8 tasks holding the work and then lets them work on it, roughly like this: // create one thread per CPU executorService = Executors.newFixedThreadPool( 8 ); ... // now process data in cycles while( ...) { // package data into 8 work items ... // create one Callable task per work item ... // submit the Callables to the worker threads executorService.invokeAll( taskList ); } This works well functionally (it does what it should), and for very large work items indeed all 8 CPUs become highly loaded, as much as the processing algorithm would be expected to allow (some work items will finish faster than others, then idle). However, as the work items become smaller (and this is not really under the program's control), the user CPU load shrinks dramatically: blocksize | system | user | cycles/sec 256k 1.8% 85% 1.30 64k 2.5% 77% 5.6 16k 4% 64% 22.5 4096 8% 56% 86 1024 13% 38% 227 256 17% 19% 420 64 19% 17% 948 16 19% 13% 1626 Legend: - block size = size of the work item (= computational steps) - system = system load, as shown in OS X Activity Monitor (red bar) - user = user load, as shown in OS X Activity Monitor (green bar) - cycles/sec = iterations through the main while loop, more is better The primary area of concern here is the high percentage of time spent in the system, which appears to be driven by thread synchronization calls. As expected, for smaller work items, ExecutorService.invokeAll() will require relatively more effort to sync up the threads versus the amount of work being performed in each thread. But since ExecutorService is more generic than it would need to be for this use case (it can queue tasks for threads if there are more tasks than cores), I though maybe there would be a leaner synchronization construct. V2 - CyclicBarrier The next implementation used a CyclicBarrier to sync up the threads before receiving work and after completing it, roughly as follows: main() { // create the barrier barrier = new CyclicBarrier( 8 + 1 ); // create Runable for thread, tell it about the barrier Runnable task = new WorkerThreadRunnable( barrier ); // start the threads for( int i = 0; i < 8; i++ ) { // create one thread per core new Thread( task ).start(); } while( ... ) { // tell threads about the work ... // N threads + this will call await(), then system proceeds barrier.await(); // ... now worker threads work on the work... // wait for worker threads to finish barrier.await(); } } class WorkerThreadRunnable implements Runnable { CyclicBarrier barrier; WorkerThreadRunnable( CyclicBarrier barrier ) { this.barrier = barrier; } public void run() { while( true ) { // wait for work barrier.await(); // do the work ... // wait for everyone else to finish barrier.await(); } } } Again, this works well functionally (it does what it should), and for very large work items indeed all 8 CPUs become highly loaded, as before. However, as the work items become smaller, the load still shrinks dramatically: blocksize | system | user | cycles/sec 256k 1.9% 85% 1.30 64k 2.7% 78% 6.1 16k 5.5% 52% 25 4096 9% 29% 64 1024 11% 15% 117 256 12% 8% 169 64 12% 6.5% 285 16 12% 6% 377 For large work items, synchronization is negligible and the performance is identical to V1. But unexpectedly, the results of the (highly specialized) CyclicBarrier seem MUCH WORSE than those for the (generic) ExecutorService: throughput (cycles/sec) is only about 1/4th of V1. A preliminary conclusion would be that even though this seems to be the advertised ideal use case for CyclicBarrier, it performs much worse than the generic ExecutorService. V3 - Wait/Notify + CyclicBarrier It seemed worth a try to replace the first cyclic barrier await() with a simple wait/notify mechanism: main() { // create the barrier // create Runable for thread, tell it about the barrier // start the threads while( ... ) { // tell threads about the work // for each: workerThreadRunnable.setWorkItem( ... ); // ... now worker threads work on the work... // wait for worker threads to finish barrier.await(); } } class WorkerThreadRunnable implements Runnable { CyclicBarrier barrier; @NotNull volatile private Callable<Integer> workItem; WorkerThreadRunnable( CyclicBarrier barrier ) { this.barrier = barrier; this.workItem = NO_WORK; } final protected void setWorkItem( @NotNull final Callable<Integer> callable ) { synchronized( this ) { workItem = callable; notify(); } } public void run() { while( true ) { // wait for work while( true ) { synchronized( this ) { if( workItem != NO_WORK ) break; try { wait(); } catch( InterruptedException e ) { e.printStackTrace(); } } } // do the work ... // wait for everyone else to finish barrier.await(); } } } Again, this works well functionally (it does what it should). blocksize | system | user | cycles/sec 256k 1.9% 85% 1.30 64k 2.4% 80% 6.3 16k 4.6% 60% 30.1 4096 8.6% 41% 98.5 1024 12% 23% 202 256 14% 11.6% 299 64 14% 10.0% 518 16 14.8% 8.7% 679 The throughput for small work items is still much worse than that of the ExecutorService, but about 2x that of the CyclicBarrier. Eliminating one CyclicBarrier eliminates half of the gap. V4 - Busy wait instead of wait/notify Since this app is the primary one running on the system and the cores idle anyway if they're not busy with a work item, why not try a busy wait for work items in each thread, even if that spins the CPU needlessly. The worker thread code changes as follows: class WorkerThreadRunnable implements Runnable { // as before final protected void setWorkItem( @NotNull final Callable<Integer> callable ) { workItem = callable; } public void run() { while( true ) { // busy-wait for work while( true ) { if( workItem != NO_WORK ) break; } // do the work ... // wait for everyone else to finish barrier.await(); } } } Also works well functionally (it does what it should). blocksize | system | user | cycles/sec 256k 1.9% 85% 1.30 64k 2.2% 81% 6.3 16k 4.2% 62% 33 4096 7.5% 40% 107 1024 10.4% 23% 210 256 12.0% 12.0% 310 64 11.9% 10.2% 550 16 12.2% 8.6% 741 For small work items, this increases throughput by a further 10% over the CyclicBarrier + wait/notify variant, which is not insignificant. But it is still much lower-throughput than V1 with the ExecutorService. V5 - ? So what is the best synchronization mechanism for such a (presumably not uncommon) problem? I am weary of writing my own sync mechanism to completely replace ExecutorService (assuming that it is too generic and there has to be something that can still be taken out to make it more efficient). It is not my area of expertise and I'm concerned that I'd spend a lot of time debugging it (since I'm not even sure my wait/notify and busy wait variants are correct) for uncertain gain. Any advice would be greatly appreciated.

Read the article

CSS layout problem on Firefox with filling space between end of left column and footer

- by Jean

Basically, the left column is supposed to extend to the footer with the continuous red color. However, in Firefox on pages with lots of text, the column does not extend to the footer and leaves a large white gap--see site: http://library.luhs.org/JHSII/about.html I've tried readjusting the heights, creating the sticky footer, and other things I've read about on this site. So I admit that I'm stumped, and what's really odd is that the layout seems to work in IE as there is no white space! I didn't create the site, but I recently inherited it and trying to work through the mess Any help is much appreciated, here's the CSS #html,body{ margin:0; padding:0; border:0; height:100%; } #body{ background:#ffffff; min-width:965px; text-align:center; width: 600px; font: Geneva, Arial, Helvetica, sans-serif; } #.style7{ clear:both; height:1px; overflow:hidden; line-height:1%; font-size:0px; margin-bottom:-1px; } #fullheightcontainer{ margin-left:auto; margin-right:auto; text-align:left; position:relative; width:965px; height:100%; } #wrapper{ min-height:100%; height:100%; background:#660000; background-color: #660000; background-repeat: repeat; } #wrapp\65 r{ height:auto; } # html wrapper{ height:100%; } #outer{ z-index:1; position:relative; margin-left:150px; width:815px; background:#FFFFFF; height:100%; background-color: #FFFFFF; } #left{ width:151px; float:left; display:inline; position:relative; margin-left:-150px; } padding: 20px; border: 0; margin: 0 0 0 240px *>html #left{width:150px;} #container-left{ width:150px; color: #CCCCCC; } * html #left{margin-right:-3px;} #center{ width:800px; float:right; display:inline; margin-left:-1px; } #clearheadercenter{ height:125px; overflow:hidden; } #clearfootercenter{ height:50px; overflow:hidden; } #footer{ z-index:1; position:relative; clear: both; width:965px; height:50px; overflow:hidden; margin-top:-50px; background-color: #660000; } #subfooter1{ background:#FFFFCC; text-align:left; margin-left:150px; height:50px; } #header{ z-index:1; position:absolute; top:0px; width:815px; margin-left:150px; height:100px; overflow:hidden; background-color: #660000; } #subheader1{ background:#FFFFCC; text-align:center; height:70px; } #gfx_bg_middle{ top:0px; position:absolute; height:100%; overflow:hidden; width:815px; margin-left:150px; background:#FFFFFF; } # html #gfx_bg_middle{ display:none; } #floatingnav { margin: 5px 10px 5px 5px; padding: 0px 5px 5px; float: right; font: .75em/1.35em Geneva, Arial, Helvetica, sans-serif; height: 600px; width: 300px; } #floatingnav a { color: #630; } #floatingnav ul { margin-top: -5; } #.floatright { float: right; margin: 0 0 10px 10px; border: 1px solid #666; padding: 2px; } #outer{ word-wrap:break-word; } #table.s1 { border-width: medium; border-spacing: 2px; border-style: none; border-color: rgb(85, 0, 0); border-collapse: collapse; background-color: white; } #table.s1 th { border-width: medium; padding: 2px; border-style: groove; border-color: red; background-color: white; -moz-border-radius: 0px 0px 0px 0px; } #table.s1 td { border-width: medium; padding: 2px; border-style: groove; border-color: #660000; background-color: #FFFFFF; -moz-border-radius: 0px 0px 0px 0px; } #a:link { color: #000066; } #a:visited { color: #000066; } #p.sample { font-family: serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: medium; line-height: 100%; word-spacing: normal; letter-spacing: normal; text-decoration: none; text-transform: none; text-align: left; text-indent: 0ex; }

Read the article

deadlocks in the innodb status

- by shantanuo

Mysql sever has suddenly become very slow. There are no queries in the slow query log but the innodb status shows something like the following. Does it mean that it is due to innodb deadlock? if Yes, what is the way out? *************************** 1. row *************************** Status: ===================================== 100315 12:55:29 INNODB MONITOR OUTPUT ===================================== Per second averages calculated from the last 5 seconds ---------- SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 187532, signal count 188120 Mutex spin waits 0, rounds 61908654, OS waits 33052 RW-shared spins 89241, OS waits 41948; RW-excl spins 5857, OS waits 1557 ------------------------ LATEST DETECTED DEADLOCK ------------------------ 100315 12:43:02 *** (1) TRANSACTION: TRANSACTION 0 56996536, ACTIVE 0 sec, process no 5000, OS thread id 3031395216 starting index read mysql tables in use 1, locked 1 LOCK WAIT 6 lock struct(s), heap size 1024, undo log entries 6 MySQL thread id 994, query id 7699751 localhost application Searching rows for update UPDATE QUERY *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 0 page no 4073 n bits 296 index `PRIMARY` of table `dbII/tbl_ticket_block_master` trx id 0 56996536 lock_mode X locks r ec but not gap waiting Record lock, heap no 141 PHYSICAL RECORD: n_fields 23; compact format; info bits 0 0: len 7; hex 33353837393936; asc 3587996;; 1: len 4; hex 800001f4; asc ;; 2: len 1; hex 47; asc G;; 3: len 2; hex 6f6b; asc ok;; 4: le n 6; hex 0000035957fe; asc YW ;; 5: len 7; hex 000000401737c0; asc @ 7 ;; 6: SQL NULL; 7: SQL NULL; 8: SQL NULL; 9: len 3; hex 8fb46e; asc n;; 10: SQL NULL; 11: len 1; hex 30; asc 0;; 12: len 0; hex ; asc ;; 13: SQL NULL; 14: len 1; hex 33; asc 3;; 15: len 4; hex 4b9ceebe ; asc K ;; 16: len 1; hex 30; asc 0;; 17: len 4; hex 80006ae8; asc j ;; 18: len 0; hex ; asc ;; 19: len 0; hex ; asc ;; 20: len 0; hex ; asc ;; 21: len 0; hex ; asc ;; 22: len 0; hex ; asc ;; *** (2) TRANSACTION: TRANSACTION 0 56996527, ACTIVE 0 sec, process no 5000, OS thread id 2961476496 fetching rows, thread declared inside InnoDB 237 mysql tables in use 3, locked 3 121 lock struct(s), heap size 11584, undo log entries 16 MySQL thread id 995, query id 7699729 localhost application Searching rows for update UPDATE QUERY *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 0 page no 4073 n bits 296 index `PRIMARY` of table `DBII/tbl_ticket_block_master` trx id 0 56996527 lock_mode X Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0 0: len 8; hex 73757072656d756d; asc supremum;; Record lock, heap no 2 PHYSICAL RECORD: n_fields 23; compact format; info bits 0 0: len 7; hex 33353837343631; asc 3587461;; 1: len 4; hex 800001f4; asc ;; 2: len 1; hex 47; asc G;; 3: len 6; hex 497373756564; asc Is sued;; 4: len 6; hex 000003425295; asc BR ;; 5: len 7; hex 8000000464012c; asc d ,;; 6: SQL NULL; 7: len 4; hex 80000058; asc X;; 8: len 1; hex 43; asc C;; 9: len 3; hex 8fb465; asc e;; 10: len 3; hex 8fb46d; asc m;; 11: len 1; hex 30; asc 0;; 12: len 0; hex ; asc ; ; 13: SQL NULL; 14: len 1; hex 33; asc 3;; 15: len 4; hex 4b9b33a2; asc K 3 ;; 16: len 3; hex 756d67; asc umg;; 17: len 4; hex 80006744; asc gD;; 18: len 0; hex ; asc ;; 19: len 0; hex ; asc ;; 20: len 0; hex ; asc ;; 21: len 0; hex ; asc ;; 22: len 0; hex ; asc ;;

Read the article

Advantage Database Server: slow stored procedure performance.

- by ie

I have a question about a performance of stored procedures in the ADS. I created a simple database with the following structure: CREATE TABLE MainTable ( Id INTEGER PRIMARY KEY, Name VARCHAR(50), Value INTEGER ); CREATE UNIQUE INDEX MainTableName_UIX ON MainTable ( Name ); CREATE TABLE SubTable ( Id INTEGER PRIMARY KEY, MainId INTEGER, Name VARCHAR(50), Value INTEGER ); CREATE INDEX SubTableMainId_UIX ON SubTable ( MainId ); CREATE UNIQUE INDEX SubTableName_UIX ON SubTable ( Name ); CREATE PROCEDURE CreateItems ( MainName VARCHAR ( 20 ), SubName VARCHAR ( 20 ), MainValue INTEGER, SubValue INTEGER, MainId INTEGER OUTPUT, SubId INTEGER OUTPUT ) BEGIN DECLARE @MainName VARCHAR ( 20 ); DECLARE @SubName VARCHAR ( 20 ); DECLARE @MainValue INTEGER; DECLARE @SubValue INTEGER; DECLARE @MainId INTEGER; DECLARE @SubId INTEGER; @MainName = (SELECT MainName FROM __input); @SubName = (SELECT SubName FROM __input); @MainValue = (SELECT MainValue FROM __input); @SubValue = (SELECT SubValue FROM __input); @MainId = (SELECT MAX(Id)+1 FROM MainTable); @SubId = (SELECT MAX(Id)+1 FROM SubTable ); INSERT INTO MainTable (Id, Name, Value) VALUES (@MainId, @MainName, @MainValue); INSERT INTO SubTable (Id, Name, MainId, Value) VALUES (@SubId, @SubName, @MainId, @SubValue); INSERT INTO __output SELECT @MainId, @SubId FROM system.iota; END; CREATE PROCEDURE UpdateItems ( MainName VARCHAR ( 20 ), MainValue INTEGER, SubValue INTEGER ) BEGIN DECLARE @MainName VARCHAR ( 20 ); DECLARE @MainValue INTEGER; DECLARE @SubValue INTEGER; DECLARE @MainId INTEGER; @MainName = (SELECT MainName FROM __input); @MainValue = (SELECT MainValue FROM __input); @SubValue = (SELECT SubValue FROM __input); @MainId = (SELECT TOP 1 Id FROM MainTable WHERE Name = @MainName); UPDATE MainTable SET Value = @MainValue WHERE Id = @MainId; UPDATE SubTable SET Value = @SubValue WHERE MainId = @MainId; END; CREATE PROCEDURE SelectItems ( MainName VARCHAR ( 20 ), CalculatedValue INTEGER OUTPUT ) BEGIN DECLARE @MainName VARCHAR ( 20 ); @MainName = (SELECT MainName FROM __input); INSERT INTO __output SELECT m.Value * s.Value FROM MainTable m INNER JOIN SubTable s ON m.Id = s.MainId WHERE m.Name = @MainName; END; CREATE PROCEDURE DeleteItems ( MainName VARCHAR ( 20 ) ) BEGIN DECLARE @MainName VARCHAR ( 20 ); DECLARE @MainId INTEGER; @MainName = (SELECT MainName FROM __input); @MainId = (SELECT TOP 1 Id FROM MainTable WHERE Name = @MainName); DELETE FROM SubTable WHERE MainId = @MainId; DELETE FROM MainTable WHERE Id = @MainId; END; Actually, the problem I had - even so light stored procedures work very-very slow (about 50-150 ms) relatively to plain queries (0-5ms). To test the performance, I created a simple test (in F# using ADS ADO.NET provider): open System; open System.Data; open System.Diagnostics; open Advantage.Data.Provider; let mainName = "main name #"; let subName = "sub name #"; // INSERT let cmdTextScriptInsert = " DECLARE @MainId INTEGER; DECLARE @SubId INTEGER; @MainId = (SELECT MAX(Id)+1 FROM MainTable); @SubId = (SELECT MAX(Id)+1 FROM SubTable ); INSERT INTO MainTable (Id, Name, Value) VALUES (@MainId, :MainName, :MainValue); INSERT INTO SubTable (Id, Name, MainId, Value) VALUES (@SubId, :SubName, @MainId, :SubValue); SELECT @MainId, @SubId FROM system.iota;"; let cmdTextProcedureInsert = "CreateItems"; // UPDATE let cmdTextScriptUpdate = " DECLARE @MainId INTEGER; @MainId = (SELECT TOP 1 Id FROM MainTable WHERE Name = :MainName); UPDATE MainTable SET Value = :MainValue WHERE Id = @MainId; UPDATE SubTable SET Value = :SubValue WHERE MainId = @MainId;"; let cmdTextProcedureUpdate = "UpdateItems"; // SELECT let cmdTextScriptSelect = " SELECT m.Value * s.Value FROM MainTable m INNER JOIN SubTable s ON m.Id = s.MainId WHERE m.Name = :MainName;"; let cmdTextProcedureSelect = "SelectItems"; // DELETE let cmdTextScriptDelete = " DECLARE @MainId INTEGER; @MainId = (SELECT TOP 1 Id FROM MainTable WHERE Name = :MainName); DELETE FROM SubTable WHERE MainId = @MainId; DELETE FROM MainTable WHERE Id = @MainId;"; let cmdTextProcedureDelete = "DeleteItems"; let cnnStr = @"data source=D:\DB\test.add; ServerType=local; user id=adssys; password=***;"; let cnn = new AdsConnection(cnnStr); try cnn.Open(); let cmd = cnn.CreateCommand(); let parametrize ix prms = cmd.Parameters.Clear(); let addParam = function | "MainName" -> cmd.Parameters.Add(":MainName" , mainName + ix.ToString()) |> ignore; | "SubName" -> cmd.Parameters.Add(":SubName" , subName + ix.ToString() ) |> ignore; | "MainValue" -> cmd.Parameters.Add(":MainValue", ix * 3 ) |> ignore; | "SubValue" -> cmd.Parameters.Add(":SubValue" , ix * 7 ) |> ignore; | _ -> () prms |> List.iter addParam; let runTest testData = let (cmdType, cmdName, cmdText, cmdParams) = testData; let toPrefix cmdType cmdName = let prefix = match cmdType with | CommandType.StoredProcedure -> "Procedure-" | CommandType.Text -> "Script -" | _ -> "Unknown -" in prefix + cmdName; let stopWatch = new Stopwatch(); let runStep ix prms = parametrize ix prms; stopWatch.Start(); cmd.ExecuteNonQuery() |> ignore; stopWatch.Stop(); cmd.CommandText <- cmdText; cmd.CommandType <- cmdType; let startId = 1500; let count = 10; for id in startId .. startId+count do runStep id cmdParams; let elapsed = stopWatch.Elapsed; Console.WriteLine("Test '{0}' - total: {1}; per call: {2}ms", toPrefix cmdType cmdName, elapsed, Convert.ToInt32(elapsed.TotalMilliseconds)/count); let lst = [ (CommandType.Text, "Insert", cmdTextScriptInsert, ["MainName"; "SubName"; "MainValue"; "SubValue"]); (CommandType.Text, "Update", cmdTextScriptUpdate, ["MainName"; "MainValue"; "SubValue"]); (CommandType.Text, "Select", cmdTextScriptSelect, ["MainName"]); (CommandType.Text, "Delete", cmdTextScriptDelete, ["MainName"]) (CommandType.StoredProcedure, "Insert", cmdTextProcedureInsert, ["MainName"; "SubName"; "MainValue"; "SubValue"]); (CommandType.StoredProcedure, "Update", cmdTextProcedureUpdate, ["MainName"; "MainValue"; "SubValue"]); (CommandType.StoredProcedure, "Select", cmdTextProcedureSelect, ["MainName"]); (CommandType.StoredProcedure, "Delete", cmdTextProcedureDelete, ["MainName"])]; lst |> List.iter runTest; finally cnn.Close(); And I'm getting the following results: Test 'Script -Insert' - total: 00:00:00.0292841; per call: 2ms Test 'Script -Update' - total: 00:00:00.0056296; per call: 0ms Test 'Script -Select' - total: 00:00:00.0051738; per call: 0ms Test 'Script -Delete' - total: 00:00:00.0059258; per call: 0ms Test 'Procedure-Insert' - total: 00:00:01.2567146; per call: 125ms Test 'Procedure-Update' - total: 00:00:00.7442440; per call: 74ms Test 'Procedure-Select' - total: 00:00:00.5120446; per call: 51ms Test 'Procedure-Delete' - total: 00:00:01.0619165; per call: 106ms The situation with the remote server is much better, but still a great gap between plaqin queries and stored procedures: Test 'Script -Insert' - total: 00:00:00.0709299; per call: 7ms Test 'Script -Update' - total: 00:00:00.0161777; per call: 1ms Test 'Script -Select' - total: 00:00:00.0258113; per call: 2ms Test 'Script -Delete' - total: 00:00:00.0166242; per call: 1ms Test 'Procedure-Insert' - total: 00:00:00.5116138; per call: 51ms Test 'Procedure-Update' - total: 00:00:00.3802251; per call: 38ms Test 'Procedure-Select' - total: 00:00:00.1241245; per call: 12ms Test 'Procedure-Delete' - total: 00:00:00.4336334; per call: 43ms Is it any chance to improve the SP performance? Please advice. ADO.NET driver version - 9.10.2.9 Server version - 9.10.0.9 (ANSI - GERMAN, OEM - GERMAN) Thanks!

Read the article

SQLite, python, unicode, and non-utf data

- by Nathan Spears

I started by trying to store strings in sqlite using python, and got the message: sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. Ok, I switched to Unicode strings. Then I started getting the message: sqlite3.OperationalError: Could not decode to UTF-8 column 'tag_artist' with text 'Sigur Rós' when trying to retrieve data from the db. More research and I started encoding it in utf8, but then 'Sigur Rós' starts looking like 'Sigur RÃ³s' note: My console was set to display in 'latin_1' as @John Machin pointed out. What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all. I didn't know much about unicode and utf before I started this process. I've learned quite a bit in the last couple hours, but I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it. If there isn't, why would sqlite 'highly recommend' I switch my application to unicode strings? I'm going to update this question with a summary and some example code of everything I've learned in the last 24 hours so that someone in my shoes can have an easy(er) guide. If the information I post is wrong or misleading in any way please tell me and I'll update, or one of you senior guys can update. Summary of answers Let me first state the goal as I understand it. The goal in processing various encodings, if you are trying to convert between them, is to understand what your source encoding is, then convert it to unicode using that source encoding, then convert it to your desired encoding. Unicode is a base and encodings are mappings of subsets of that base. utf_8 has room for every character in unicode, but because they aren't in the same place as, for instance, latin_1, a string encoded in utf_8 and sent to a latin_1 console will not look the way you expect. In python the process of getting to unicode and into another encoding looks like: str.decode('source_encoding').encode('desired_encoding') or if the str is already in unicode str.encode('desired_encoding') For sqlite I didn't actually want to encode it again, I wanted to decode it and leave it in unicode format. Here are four things you might need to be aware of as you try to work with unicode and encodings in python. The encoding of the string you want to work with, and the encoding you want to get it to. The system encoding. The console encoding. The encoding of the source file Elaboration: (1) When you read a string from a source, it must have some encoding, like latin_1 or utf_8. In my case, I'm getting strings from filenames, so unfortunately, I could be getting any kind of encoding. Windows XP uses UCS-2 (a Unicode system) as its native string type, which seems like cheating to me. Fortunately for me, the characters in most filenames are not going to be made up of more than one source encoding type, and I think all of mine were either completely latin_1, completely utf_8, or just plain ascii (which is a subset of both of those). So I just read them and decoded them as if they were still in latin_1 or utf_8. It's possible, though, that you could have latin_1 and utf_8 and whatever other characters mixed together in a filename on Windows. Sometimes those characters can show up as boxes, other times they just look mangled, and other times they look correct (accented characters and whatnot). Moving on. (2) Python has a default system encoding that gets set when python starts and can't be changed during runtime. See here for details. Dirty summary ... well here's the file I added: \# sitecustomize.py \# this file can be anywhere in your Python path, \# but it usually goes in ${pythondir}/lib/site-packages/ import sys sys.setdefaultencoding('utf_8') This system encoding is the one that gets used when you use the unicode("str") function without any other encoding parameters. To say that another way, python tries to decode "str" to unicode based on the default system encoding. (3) If you're using IDLE or the command-line python, I think that your console will display according to the default system encoding. I am using pydev with eclipse for some reason, so I had to go into my project settings, edit the launch configuration properties of my test script, go to the Common tab, and change the console from latin-1 to utf-8 so that I could visually confirm what I was doing was working. (4) If you want to have some test strings, eg test_str = "ó" in your source code, then you will have to tell python what kind of encoding you are using in that file. (FYI: when I mistyped an encoding I had to ctrl-Z because my file became unreadable.) This is easily accomplished by putting a line like so at the top of your source code file: # -*- coding: utf_8 -*- If you don't have this information, python attempts to parse your code as ascii by default, and so: SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details Once your program is working correctly, or, if you aren't using python's console or any other console to look at output, then you will probably really only care about #1 on the list. System default and console encoding are not that important unless you need to look at output and/or you are using the builtin unicode() function (without any encoding parameters) instead of the string.decode() function. I wrote a demo function I will paste into the bottom of this gigantic mess that I hope correctly demonstrates the items in my list. Here is some of the output when I run the character 'ó' through the demo function, showing how various methods react to the character as input. My system encoding and console output are both set to utf_8 for this run: '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Now I will change the system and console encoding to latin_1, and I get this output for the same input: 'ó' = original char <type 'str'> repr(char)='\xf3' 'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Notice that the 'original' character displays correctly and the builtin unicode() function works now. Now I change my console output back to utf_8. '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' '?' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Here everything still works the same as last time but the console can't display the output correctly. Etc. The function below also displays more information that this and hopefully would help someone figure out where the gap in their understanding is. I know all this information is in other places and more thoroughly dealt with there, but I hope that this would be a good kickoff point for someone trying to get coding with python and/or sqlite. Ideas are great but sometimes source code can save you a day or two of trying to figure out what functions do what. Disclaimers: I'm no encoding expert, I put this together to help my own understanding. I kept building on it when I should have probably started passing functions as arguments to avoid so much redundant code, so if I can I'll make it more concise. Also, utf_8 and latin_1 are by no means the only encoding schemes, they are just the two I was playing around with because I think they handle everything I need. Add your own encoding schemes to the demo function and test your own input. One more thing: there are apparently crazy application developers making life difficult in Windows. #!/usr/bin/env python # -*- coding: utf_8 -*- import os import sys def encodingDemo(str): validStrings = () try: print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str)) validStrings += ((str,""),) except UnicodeEncodeError as ude: print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print ude try: x = unicode(str) print "unicode(str) = ",x validStrings+= ((x, " decoded into unicode by the default system encoding"),) except UnicodeDecodeError as ude: print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string." print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()), print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('latin_1') print "str.decode('latin_1') =",x validStrings+= ((x, " decoded with latin_1 into unicode"),) try: print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8') validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),) except UnicodeDecodeError as ude: print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t", print ude except UnicodeDecodeError as ude: print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('utf_8') print "str.decode('utf_8') =",x validStrings+= ((x, " decoded with utf_8 into unicode"),) try: print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1') except UnicodeDecodeError as ude: print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t", validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),) print ude except UnicodeDecodeError as ude: print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee print print "Printing information about each character in the original string." for char in str: try: print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char)) except UnicodeDecodeError as ude: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude) except UnicodeEncodeError as uee: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee) print uee try: x = unicode(char) print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = unicode(char) ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('latin_1') print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('utf_8') print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) print x = 'ó' encodingDemo(x) Much thanks for the answers below and especially to @John Machin for answering so thoroughly.

Read the article

SQLite, python, unicode, and non-utf data

- by Nathan Spears

I started by trying to store strings in sqlite using python, and got the message: sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. Ok, I switched to Unicode strings. Then I started getting the message: sqlite3.OperationalError: Could not decode to UTF-8 column 'tag_artist' with text 'Sigur Rós' when trying to retrieve data from the db. More research and I started encoding it in utf8, but then 'Sigur Rós' starts looking like 'Sigur RÃ³s' note: My console was set to display in 'latin_1' as @John Machin pointed out. What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all. I didn't know much about unicode and utf before I started this process. I've learned quite a bit in the last couple hours, but I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it. If there isn't, why would sqlite 'highly recommend' I switch my application to unicode strings? I'm going to update this question with a summary and some example code of everything I've learned in the last 24 hours so that someone in my shoes can have an easy(er) guide. If the information I post is wrong or misleading in any way please tell me and I'll update, or one of you senior guys can update. Summary of answers Let me first state the goal as I understand it. The goal in processing various encodings, if you are trying to convert between them, is to understand what your source encoding is, then convert it to unicode using that source encoding, then convert it to your desired encoding. Unicode is a base and encodings are mappings of subsets of that base. utf_8 has room for every character in unicode, but because they aren't in the same place as, for instance, latin_1, a string encoded in utf_8 and sent to a latin_1 console will not look the way you expect. In python the process of getting to unicode and into another encoding looks like: str.decode('source_encoding').encode('desired_encoding') or if the str is already in unicode str.encode('desired_encoding') For sqlite I didn't actually want to encode it again, I wanted to decode it and leave it in unicode format. Here are four things you might need to be aware of as you try to work with unicode and encodings in python. The encoding of the string you want to work with, and the encoding you want to get it to. The system encoding. The console encoding. The encoding of the source file Elaboration: (1) When you read a string from a source, it must have some encoding, like latin_1 or utf_8. In my case, I'm getting strings from filenames, so unfortunately, I could be getting any kind of encoding. Windows XP uses UCS-2 (a Unicode system) as its native string type, which seems like cheating to me. Fortunately for me, the characters in most filenames are not going to be made up of more than one source encoding type, and I think all of mine were either completely latin_1, completely utf_8, or just plain ascii (which is a subset of both of those). So I just read them and decoded them as if they were still in latin_1 or utf_8. It's possible, though, that you could have latin_1 and utf_8 and whatever other characters mixed together in a filename on Windows. Sometimes those characters can show up as boxes, other times they just look mangled, and other times they look correct (accented characters and whatnot). Moving on. (2) Python has a default system encoding that gets set when python starts and can't be changed during runtime. See here for details. Dirty summary ... well here's the file I added: \# sitecustomize.py \# this file can be anywhere in your Python path, \# but it usually goes in ${pythondir}/lib/site-packages/ import sys sys.setdefaultencoding('utf_8') This system encoding is the one that gets used when you use the unicode("str") function without any other encoding parameters. To say that another way, python tries to decode "str" to unicode based on the default system encoding. (3) If you're using IDLE or the command-line python, I think that your console will display according to the default system encoding. I am using pydev with eclipse for some reason, so I had to go into my project settings, edit the launch configuration properties of my test script, go to the Common tab, and change the console from latin-1 to utf-8 so that I could visually confirm what I was doing was working. (4) If you want to have some test strings, eg test_str = "ó" in your source code, then you will have to tell python what kind of encoding you are using in that file. (FYI: when I mistyped an encoding I had to ctrl-Z because my file became unreadable.) This is easily accomplished by putting a line like so at the top of your source code file: # -*- coding: utf_8 -*- If you don't have this information, python attempts to parse your code as ascii by default, and so: SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details Once your program is working correctly, or, if you aren't using python's console or any other console to look at output, then you will probably really only care about #1 on the list. System default and console encoding are not that important unless you need to look at output and/or you are using the builtin unicode() function (without any encoding parameters) instead of the string.decode() function. I wrote a demo function I will paste into the bottom of this gigantic mess that I hope correctly demonstrates the items in my list. Here is some of the output when I run the character 'ó' through the demo function, showing how various methods react to the character as input. My system encoding and console output are both set to utf_8 for this run: '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Now I will change the system and console encoding to latin_1, and I get this output for the same input: 'ó' = original char <type 'str'> repr(char)='\xf3' 'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Notice that the 'original' character displays correctly and the builtin unicode() function works now. Now I change my console output back to utf_8. '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' '?' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Here everything still works the same as last time but the console can't display the output correctly. Etc. The function below also displays more information that this and hopefully would help someone figure out where the gap in their understanding is. I know all this information is in other places and more thoroughly dealt with there, but I hope that this would be a good kickoff point for someone trying to get coding with python and/or sqlite. Ideas are great but sometimes source code can save you a day or two of trying to figure out what functions do what. Disclaimers: I'm no encoding expert, I put this together to help my own understanding. I kept building on it when I should have probably started passing functions as arguments to avoid so much redundant code, so if I can I'll make it more concise. Also, utf_8 and latin_1 are by no means the only encoding schemes, they are just the two I was playing around with because I think they handle everything I need. Add your own encoding schemes to the demo function and test your own input. One more thing: there are apparently crazy application developers making life difficult in Windows. #!/usr/bin/env python # -*- coding: utf_8 -*- import os import sys def encodingDemo(str): validStrings = () try: print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str)) validStrings += ((str,""),) except UnicodeEncodeError as ude: print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print ude try: x = unicode(str) print "unicode(str) = ",x validStrings+= ((x, " decoded into unicode by the default system encoding"),) except UnicodeDecodeError as ude: print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string." print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()), print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('latin_1') print "str.decode('latin_1') =",x validStrings+= ((x, " decoded with latin_1 into unicode"),) try: print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8') validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),) except UnicodeDecodeError as ude: print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t", print ude except UnicodeDecodeError as ude: print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('utf_8') print "str.decode('utf_8') =",x validStrings+= ((x, " decoded with utf_8 into unicode"),) try: print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1') except UnicodeDecodeError as ude: print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t", validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),) print ude except UnicodeDecodeError as ude: print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee print print "Printing information about each character in the original string." for char in str: try: print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char)) except UnicodeDecodeError as ude: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude) except UnicodeEncodeError as uee: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee) print uee try: x = unicode(char) print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = unicode(char) ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('latin_1') print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('utf_8') print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) print x = 'ó' encodingDemo(x) Much thanks for the answers below and especially to @John Machin for answering so thoroughly.

Developer IT