Search Results

Search found 9137 results on 366 pages for 'worker thread'.

Page 86/366 | < Previous Page | 82 83 84 85 86 87 88 89 90 91 92 93 | Next Page >

PTLQueue : a scalable bounded-capacity MPMC queue

- by Dave

Title: Fast concurrent MPMC queue -- I've used the following concurrent queue algorithm enough that it warrants a blog entry. I'll sketch out the design of a fast and scalable multiple-producer multiple-consumer (MPSC) concurrent queue called PTLQueue. The queue has bounded capacity and is implemented via a circular array. Bounded capacity can be a useful property if there's a mismatch between producer rates and consumer rates where an unbounded queue might otherwise result in excessive memory consumption by virtue of the container nodes that -- in some queue implementations -- are used to hold values. A bounded-capacity queue can provide flow control between components. Beware, however, that bounded collections can also result in resource deadlock if abused. The put() and take() operators are partial and wait for the collection to become non-full or non-empty, respectively. Put() and take() do not allocate memory, and are not vulnerable to the ABA pathologies. The PTLQueue algorithm can be implemented equally well in C/C++ and Java. Partial operators are often more convenient than total methods. In many use cases if the preconditions aren't met, there's nothing else useful the thread can do, so it may as well wait via a partial method. An exception is in the case of work-stealing queues where a thief might scan a set of queues from which it could potentially steal. Total methods return ASAP with a success-failure indication. (It's tempting to describe a queue or API as blocking or non-blocking instead of partial or total, but non-blocking is already an overloaded concurrency term. Perhaps waiting/non-waiting or patient/impatient might be better terms). It's also trivial to construct partial operators by busy-waiting via total operators, but such constructs may be less efficient than an operator explicitly and intentionally designed to wait. A PTLQueue instance contains an array of slots, where each slot has volatile Turn and MailBox fields. The array has power-of-two length allowing mod/div operations to be replaced by masking. We assume sensible padding and alignment to reduce the impact of false sharing. (On x86 I recommend 128-byte alignment and padding because of the adjacent-sector prefetch facility). Each queue also has PutCursor and TakeCursor cursor variables, each of which should be sequestered as the sole occupant of a cache line or sector. You can opt to use 64-bit integers if concerned about wrap-around aliasing in the cursor variables. Put(null) is considered illegal, but the caller or implementation can easily check for and convert null to a distinguished non-null proxy value if null happens to be a value you'd like to pass. Take() will accordingly convert the proxy value back to null. An advantage of PTLQueue is that you can use atomic fetch-and-increment for the partial methods. We initialize each slot at index I with (Turn=I, MailBox=null). Both cursors are initially 0. All shared variables are considered "volatile" and atomics such as CAS and AtomicFetchAndIncrement are presumed to have bidirectional fence semantics. Finally T is the templated type. I've sketched out a total tryTake() method below that allows the caller to poll the queue. tryPut() has an analogous construction. Zebra stripping : alternating row colors for nice-looking code listings. See also google code "prettify" : https://code.google.com/p/google-code-prettify/ Prettify is a javascript module that yields the HTML/CSS/JS equivalent of pretty-print. -- pre:nth-child(odd) { background-color:#ff0000; } pre:nth-child(even) { background-color:#0000ff; } border-left: 11px solid #ccc; margin: 1.7em 0 1.7em 0.3em; background-color:#BFB; font-size:12px; line-height:65%; " // PTLQueue : Put(v) : // producer : partial method - waits as necessary assert v != null assert Mask = 1 && (Mask & (Mask+1)) == 0 // Document invariants // doorway step // Obtain a sequence number -- ticket // As a practical concern the ticket value is temporally unique // The ticket also identifies and selects a slot auto tkt = AtomicFetchIncrement (&PutCursor, 1) slot * s = &Slots[tkt & Mask] // waiting phase : // wait for slot's generation to match the tkt value assigned to this put() invocation. // The "generation" is implicitly encoded as the upper bits in the cursor // above those used to specify the index : tkt div (Mask+1) // The generation serves as an epoch number to identify a cohort of threads // accessing disjoint slots while s-Turn != tkt : Pause assert s-MailBox == null s-MailBox = v // deposit and pass message Take() : // consumer : partial method - waits as necessary auto tkt = AtomicFetchIncrement (&TakeCursor,1) slot * s = &Slots[tkt & Mask] // 2-stage waiting : // First wait for turn for our generation // Acquire exclusive "take" access to slot's MailBox field // Then wait for the slot to become occupied while s-Turn != tkt : Pause // Concurrency in this section of code is now reduced to just 1 producer thread // vs 1 consumer thread. // For a given queue and slot, there will be most one Take() operation running // in this section. // Consumer waits for producer to arrive and make slot non-empty // Extract message; clear mailbox; advance Turn indicator // We have an obvious happens-before relation : // Put(m) happens-before corresponding Take() that returns that same "m" for T v = s-MailBox if v != null : s-MailBox = null ST-ST barrier s-Turn = tkt + Mask + 1 // unlock slot to admit next producer and consumer return v Pause tryTake() : // total method - returns ASAP with failure indication for auto tkt = TakeCursor slot * s = &Slots[tkt & Mask] if s-Turn != tkt : return null T v = s-MailBox // presumptive return value if v == null : return null // ratify tkt and v values and commit by advancing cursor if CAS (&TakeCursor, tkt, tkt+1) != tkt : continue s-MailBox = null ST-ST barrier s-Turn = tkt + Mask + 1 return v The basic idea derives from the Partitioned Ticket Lock "PTL" (US20120240126-A1) and the MultiLane Concurrent Bag (US8689237). The latter is essentially a circular ring-buffer where the elements themselves are queues or concurrent collections. You can think of the PTLQueue as a partitioned ticket lock "PTL" augmented to pass values from lock to unlock via the slots. Alternatively, you could conceptualize of PTLQueue as a degenerate MultiLane bag where each slot or "lane" consists of a simple single-word MailBox instead of a general queue. Each lane in PTLQueue also has a private Turn field which acts like the Turn (Grant) variables found in PTL. Turn enforces strict FIFO ordering and restricts concurrency on the slot mailbox field to at most one simultaneous put() and take() operation. PTL uses a single "ticket" variable and per-slot Turn (grant) fields while MultiLane has distinct PutCursor and TakeCursor cursors and abstract per-slot sub-queues. Both PTL and MultiLane advance their cursor and ticket variables with atomic fetch-and-increment. PTLQueue borrows from both PTL and MultiLane and has distinct put and take cursors and per-slot Turn fields. Instead of a per-slot queues, PTLQueue uses a simple single-word MailBox field. PutCursor and TakeCursor act like a pair of ticket locks, conferring "put" and "take" access to a given slot. PutCursor, for instance, assigns an incoming put() request to a slot and serves as a PTL "Ticket" to acquire "put" permission to that slot's MailBox field. To better explain the operation of PTLQueue we deconstruct the operation of put() and take() as follows. Put() first increments PutCursor obtaining a new unique ticket. That ticket value also identifies a slot. Put() next waits for that slot's Turn field to match that ticket value. This is tantamount to using a PTL to acquire "put" permission on the slot's MailBox field. Finally, having obtained exclusive "put" permission on the slot, put() stores the message value into the slot's MailBox. Take() similarly advances TakeCursor, identifying a slot, and then acquires and secures "take" permission on a slot by waiting for Turn. Take() then waits for the slot's MailBox to become non-empty, extracts the message, and clears MailBox. Finally, take() advances the slot's Turn field, which releases both "put" and "take" access to the slot's MailBox. Note the asymmetry : put() acquires "put" access to the slot, but take() releases that lock. At any given time, for a given slot in a PTLQueue, at most one thread has "put" access and at most one thread has "take" access. This restricts concurrency from general MPMC to 1-vs-1. We have 2 ticket locks -- one for put() and one for take() -- each with its own "ticket" variable in the form of the corresponding cursor, but they share a single "Grant" egress variable in the form of the slot's Turn variable. Advancing the PutCursor, for instance, serves two purposes. First, we obtain a unique ticket which identifies a slot. Second, incrementing the cursor is the doorway protocol step to acquire the per-slot mutual exclusion "put" lock. The cursors and operations to increment those cursors serve double-duty : slot-selection and ticket assignment for locking the slot's MailBox field. At any given time a slot MailBox field can be in one of the following states: empty with no pending operations -- neutral state; empty with one or more waiting take() operations pending -- deficit; occupied with no pending operations; occupied with one or more waiting put() operations -- surplus; empty with a pending put() or pending put() and take() operations -- transitional; or occupied with a pending take() or pending put() and take() operations -- transitional. The partial put() and take() operators can be implemented with an atomic fetch-and-increment operation, which may confer a performance advantage over a CAS-based loop. In addition we have independent PutCursor and TakeCursor cursors. Critically, a put() operation modifies PutCursor but does not access the TakeCursor and a take() operation modifies the TakeCursor cursor but does not access the PutCursor. This acts to reduce coherence traffic relative to some other queue designs. It's worth noting that slow threads or obstruction in one slot (or "lane") does not impede or obstruct operations in other slots -- this gives us some degree of obstruction isolation. PTLQueue is not lock-free, however. The implementation above is expressed with polite busy-waiting (Pause) but it's trivial to implement per-slot parking and unparking to deschedule waiting threads. It's also easy to convert the queue to a more general deque by replacing the PutCursor and TakeCursor cursors with Left/Front and Right/Back cursors that can move either direction. Specifically, to push and pop from the "left" side of the deque we would decrement and increment the Left cursor, respectively, and to push and pop from the "right" side of the deque we would increment and decrement the Right cursor, respectively. We used a variation of PTLQueue for message passing in our recent OPODIS 2013 paper. ul { list-style:none; padding-left:0; padding:0; margin:0; margin-left:0; } ul#myTagID { padding: 0px; margin: 0px; list-style:none; margin-left:0;} -- -- There's quite a bit of related literature in this area. I'll call out a few relevant references: Wilson's NYU Courant Institute UltraComputer dissertation from 1988 is classic and the canonical starting point : Operating System Data Structures for Shared-Memory MIMD Machines with Fetch-and-Add. Regarding provenance and priority, I think PTLQueue or queues effectively equivalent to PTLQueue have been independently rediscovered a number of times. See CB-Queue and BNPBV, below, for instance. But Wilson's dissertation anticipates the basic idea and seems to predate all the others. Gottlieb et al : Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors Orozco et al : CB-Queue in Toward high-throughput algorithms on many-core architectures which appeared in TACO 2012. Meneghin et al : BNPVB family in Performance evaluation of inter-thread communication mechanisms on multicore/multithreaded architecture Dmitry Vyukov : bounded MPMC queue (highly recommended) Alex Otenko : US8607249 (highly related). John Mellor-Crummey : Concurrent queues: Practical fetch-and-phi algorithms. Technical Report 229, Department of Computer Science, University of Rochester Thomasson : FIFO Distributed Bakery Algorithm (very similar to PTLQueue). Scott and Scherer : Dual Data Structures I'll propose an optimization left as an exercise for the reader. Say we wanted to reduce memory usage by eliminating inter-slot padding. Such padding is usually "dark" memory and otherwise unused and wasted. But eliminating the padding leaves us at risk of increased false sharing. Furthermore lets say it was usually the case that the PutCursor and TakeCursor were numerically close to each other. (That's true in some use cases). We might still reduce false sharing by incrementing the cursors by some value other than 1 that is not trivially small and is coprime with the number of slots. Alternatively, we might increment the cursor by one and mask as usual, resulting in a logical index. We then use that logical index value to index into a permutation table, yielding an effective index for use in the slot array. The permutation table would be constructed so that nearby logical indices would map to more distant effective indices. (Open question: what should that permutation look like? Possibly some perversion of a Gray code or De Bruijn sequence might be suitable). As an aside, say we need to busy-wait for some condition as follows : "while C == 0 : Pause". Lets say that C is usually non-zero, so we typically don't wait. But when C happens to be 0 we'll have to spin for some period, possibly brief. We can arrange for the code to be more machine-friendly with respect to the branch predictors by transforming the loop into : "if C == 0 : for { Pause; if C != 0 : break; }". Critically, we want to restructure the loop so there's one branch that controls entry and another that controls loop exit. A concern is that your compiler or JIT might be clever enough to transform this back to "while C == 0 : Pause". You can sometimes avoid this by inserting a call to a some type of very cheap "opaque" method that the compiler can't elide or reorder. On Solaris, for instance, you could use :"if C == 0 : { gethrtime(); for { Pause; if C != 0 : break; }}". It's worth noting the obvious duality between locks and queues. If you have strict FIFO lock implementation with local spinning and succession by direct handoff such as MCS or CLH,then you can usually transform that lock into a queue. Hidden commentary and annotations - invisible : * And of course there's a well-known duality between queues and locks, but I'll leave that topic for another blog post. * Compare and contrast : PTLQ vs PTL and MultiLane * Equivalent : Turn; seq; sequence; pos; position; ticket * Put = Lock; Deposit Take = identify and reserve slot; wait; extract & clear; unlock * conceptualize : Distinct PutLock and TakeLock implemented as ticket lock or PTL Distinct arrival cursors but share per-slot "Turn" variable provides exclusive role-based access to slot's mailbox field put() acquires exclusive access to a slot for purposes of "deposit" assigns slot round-robin and then acquires deposit access rights/perms to that slot take() acquires exclusive access to slot for purposes of "withdrawal" assigns slot round-robin and then acquires withdrawal access rights/perms to that slot At any given time, only one thread can have withdrawal access to a slot at any given time, only one thread can have deposit access to a slot Permissible for T1 to have deposit access and T2 to simultaneously have withdrawal access * round-robin for the purposes of; role-based; access mode; access role mailslot; mailbox; allocate/assign/identify slot rights; permission; license; access permission; * PTL/Ticket hybrid Asymmetric usage ; owner oblivious lock-unlock pairing K-exclusion add Grant cursor pass message m from lock to unlock via Slots[] array Cursor performs 2 functions : + PTL ticket + Assigns request to slot in round-robin fashion Deconstruct protocol : explication put() : allocate slot in round-robin fashion acquire PTL for "put" access store message into slot associated with PTL index take() : Acquire PTL for "take" access // doorway step seq = fetchAdd (&Grant, 1) s = &Slots[seq & Mask] // waiting phase while s-Turn != seq : pause Extract : wait for s-mailbox to be full v = s-mailbox s-mailbox = null Release PTL for both "put" and "take" access s-Turn = seq + Mask + 1 * Slot round-robin assignment and lock "doorway" protocol leverage the same cursor and FetchAdd operation on that cursor FetchAdd (&Cursor,1) + round-robin slot assignment and dispersal + PTL/ticket lock "doorway" step waiting phase is via "Turn" field in slot * PTLQueue uses 2 cursors -- put and take. Acquire "put" access to slot via PTL-like lock Acquire "take" access to slot via PTL-like lock 2 locks : put and take -- at most one thread can access slot's mailbox Both locks use same "turn" field Like multilane : 2 cursors : put and take slot is simple 1-capacity mailbox instead of queue Borrow per-slot turn/grant from PTL Provides strict FIFO Lock slot : put-vs-put take-vs-take at most one put accesses slot at any one time at most one put accesses take at any one time reduction to 1-vs-1 instead of N-vs-M concurrency Per slot locks for put/take Release put/take by advancing turn * is instrumental in ... * P-V Semaphore vs lock vs K-exclusion * See also : FastQueues-excerpt.java dice-etc/queue-mpmc-bounded-blocking-circular-xadd/ * PTLQueue is the same as PTLQB - identical * Expedient return; ASAP; prompt; immediately * Lamport's Bakery algorithm : doorway step then waiting phase Threads arriving at doorway obtain a unique ticket number Threads enter in ticket order * In the terminology of Reed and Kanodia a ticket lock corresponds to the busy-wait implementation of a semaphore using an eventcount and a sequencer It can also be thought of as an optimization of Lamport's bakery lock was designed for fault-tolerance rather than performance Instead of spinning on the release counter, processors using a bakery lock repeatedly examine the tickets of their peers --

Read the article
Windows Azure Recipe: High Performance Computing

- by Clint Edmonson

One of the most attractive ways to use a cloud platform is for parallel processing. Commonly known as high-performance computing (HPC), this approach relies on executing code on many machines at the same time. On Windows Azure, this means running many role instances simultaneously, all working in parallel to solve some problem. Doing this requires some way to schedule applications, which means distributing their work across these instances. To allow this, Windows Azure provides the HPC Scheduler. This service can work with HPC applications built to use the industry-standard Message Passing Interface (MPI). Software that does finite element analysis, such as car crash simulations, is one example of this type of application, and there are many others. The HPC Scheduler can also be used with so-called embarrassingly parallel applications, such as Monte Carlo simulations. Whatever problem is addressed, the value this component provides is the same: It handles the complex problem of scheduling parallel computing work across many Windows Azure worker role instances. Drivers Elastic compute and storage resources Cost avoidance Solution Here’s a sketch of a solution using our Windows Azure HPC SDK: Ingredients Web Role – this hosts a HPC scheduler web portal to allow web based job submission and management. It also exposes an HTTP web service API to allow other tools (including Visual Studio) to post jobs as well. Worker Role – typically multiple worker roles are enlisted, including at least one head node that schedules jobs to be run among the remaining compute nodes. Database – stores state information about the job queue and resource configuration for the solution. Blobs, Tables, Queues, Caching (optional) – many parallel algorithms persist intermediate and/or permanent data as a result of their processing. These fast, highly reliable, parallelizable storage options are all available to all the jobs being processed. Training Here is a link to online Windows Azure training labs where you can learn more about the individual ingredients described above. (Note: The entire Windows Azure Training Kit can also be downloaded for offline use.) Windows Azure HPC Scheduler (3 labs) The Windows Azure HPC Scheduler includes modules and features that enable you to launch and manage high-performance computing (HPC) applications and other parallel workloads within a Windows Azure service. The scheduler supports parallel computational tasks such as parametric sweeps, Message Passing Interface (MPI) processes, and service-oriented architecture (SOA) requests across your computing resources in Windows Azure. With the Windows Azure HPC Scheduler SDK, developers can create Windows Azure deployments that support scalable, compute-intensive, parallel applications. See my Windows Azure Resource Guide for more guidance on how to get started, including links web portals, training kits, samples, and blogs related to Windows Azure.

Read the article
Error java.lang.OutOfMemoryError: getNewTla using Oracle EPM products

- by Marc Schumacher

Running into a Java out of memory error, it is very common behaviour in the field that the Java heap size will be increased. While this might help to solve a heap space out of memory error, it might not help to fix an out of memory error for the Thread Local Area (TLA). Increasing the available heap space from 1 GB to 16 GB might not even help in this situation. The Thread Local Area (TLA) is part of the Java heap, but as the name already indicates, this memory area is local to a specific thread so there is no need to synchronize with other threads using this memory area. For optimization purposes the TLA size is configurable using the Java command line option “-XXtlasize”. Depending on the JRockit version and the available Java heap, the default values vary. Using Oracle EPM System (mainly 11.1.2.x) the following setting was tested successfully: -XXtlasize:min=8k,preferred=128k More information about the “-XXtlasize” parameter can be found in the JRockit documentation: http://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/jrdocs/refman/optionXX.html

Read the article
How To Get Web Site Thumbnail Image In ASP.NET

- by SAMIR BHOGAYTA

Overview One very common requirement of many web applications is to display a thumbnail image of a web site. A typical example is to provide a link to a dynamic website displaying its current thumbnail image, or displaying images of websites with their links as a result of search (I love to see it on Google). Microsoft .NET Framework 2.0 makes it quite easier to do it in a ASP.NET application. Background In order to generate image of a web page, first we need to load the web page to get their html code, and then this html needs to be rendered in a web browser. After that, a screen shot can be taken easily. I think there is no easier way to do this. Before .NET framework 2.0 it was quite difficult to use a web browser in C# or VB.NET because we either have to use COM+ interoperability or third party controls which becomes headache later. WebBrowser control in .NET framework 2.0 In .NET framework 2.0 we have a new Windows Forms WebBrowser control which is a wrapper around old shwdoc.dll. All you really need to do is to drop a WebBrowser control from your Toolbox on your form in .NET framework 2.0. If you have not used WebBrowser control yet, it's quite easy to use and very consistent with other Windows Forms controls. Some important methods of WebBrowser control are. public bool GoBack(); public bool GoForward(); public void GoHome(); public void GoSearch(); public void Navigate(Uri url); public void DrawToBitmap(Bitmap bitmap, Rectangle targetBounds); These methods are self explanatory with their names like Navigate function which redirects browser to provided URL. It also has a number of useful overloads. The DrawToBitmap (inherited from Control) draws the current image of WebBrowser to the provided bitmap. Using WebBrowser control in ASP.NET 2.0 The Solution Let's start to implement the solution which we discussed above. First we will define a static method to get the web site thumbnail image. public static Bitmap GetWebSiteThumbnail(string Url, int BrowserWidth, int BrowserHeight, int ThumbnailWidth, int ThumbnailHeight) { WebsiteThumbnailImage thumbnailGenerator = new WebsiteThumbnailImage(Url, BrowserWidth, BrowserHeight, ThumbnailWidth, ThumbnailHeight); return thumbnailGenerator.GenerateWebSiteThumbnailImage(); } The WebsiteThumbnailImage class will have a public method named GenerateWebSiteThumbnailImage which will generate the website thumbnail image in a separate STA thread and wait for the thread to exit. In this case, I decided to Join method of Thread class to block the initial calling thread until the bitmap is actually available, and then return the generated web site thumbnail. public Bitmap GenerateWebSiteThumbnailImage() { Thread m_thread = new Thread(new ThreadStart(_GenerateWebSiteThumbnailImage)); m_thread.SetApartmentState(ApartmentState.STA); m_thread.Start(); m_thread.Join(); return m_Bitmap; } The _GenerateWebSiteThumbnailImage will create a WebBrowser control object and navigate to the provided Url. We also register for the DocumentCompleted event of the web browser control to take screen shot of the web page. To pass the flow to the other controls we need to perform a method call to Application.DoEvents(); and wait for the completion of the navigation until the browser state changes to Complete in a loop. private void _GenerateWebSiteThumbnailImage() { WebBrowser m_WebBrowser = new WebBrowser(); m_WebBrowser.ScrollBarsEnabled = false; m_WebBrowser.Navigate(m_Url); m_WebBrowser.DocumentCompleted += new WebBrowserDocument CompletedEventHandler(WebBrowser_DocumentCompleted); while (m_WebBrowser.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents(); m_WebBrowser.Dispose(); } The DocumentCompleted event will be fired when the navigation is completed and the browser is ready for screen shot. We will get screen shot using DrawToBitmap method as described previously which will return the bitmap of the web browser. Then the thumbnail image is generated using GetThumbnailImage method of Bitmap class passing it the required thumbnail image width and height. private void WebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) { WebBrowser m_WebBrowser = (WebBrowser)sender; m_WebBrowser.ClientSize = new Size(this.m_BrowserWidth, this.m_BrowserHeight); m_WebBrowser.ScrollBarsEnabled = false; m_Bitmap = new Bitmap(m_WebBrowser.Bounds.Width, m_WebBrowser.Bounds.Height); m_WebBrowser.BringToFront(); m_WebBrowser.DrawToBitmap(m_Bitmap, m_WebBrowser.Bounds); m_Bitmap = (Bitmap)m_Bitmap.GetThumbnailImage(m_ThumbnailWidth, m_ThumbnailHeight, null, IntPtr.Zero); } One more example here : http://www.codeproject.com/KB/aspnet/Website_URL_Screenshot.aspx

Read the article
Polite busy-waiting with WRPAUSE on SPARC

- by Dave

Unbounded busy-waiting is an poor idea for user-space code, so we typically use spin-then-block strategies when, say, waiting for a lock to be released or some other event. If we're going to spin, even briefly, then we'd prefer to do so in a manner that minimizes performance degradation for other sibling logical processors ("strands") that share compute resources. We want to spin politely and refrain from impeding the progress and performance of other threads — ostensibly doing useful work and making progress — that run on the same core. On a SPARC T4, for instance, 8 strands will share a core, and that core has its own L1 cache and 2 pipelines. On x86 we have the PAUSE instruction, which, naively, can be thought of as a hardware "yield" operator which temporarily surrenders compute resources to threads on sibling strands. Of course this helps avoid intra-core performance interference. On the SPARC T2 our preferred busy-waiting idiom was "RD %CCR,%G0" which is a high-latency no-nop. The T4 provides a dedicated and extremely useful WRPAUSE instruction. The processor architecture manuals are the authoritative source, but briefly, WRPAUSE writes a cycle count into the the PAUSE register, which is ASR27. Barring interrupts, the processor then delays for the requested period. There's no need for the operating system to save the PAUSE register over context switches as it always resets to 0 on traps. Digressing briefly, if you use unbounded spinning then ultimately the kernel will preempt and deschedule your thread if there are other ready threads than are starving. But by using a spin-then-block strategy we can allow other ready threads to run without resorting to involuntary time-slicing, which operates on a long-ish time scale. Generally, that makes your application more responsive. In addition, by blocking voluntarily we give the operating system far more latitude regarding power management. Finally, I should note that while we have OS-level facilities like sched_yield() at our disposal, yielding almost never does what you'd want or naively expect. Returning to WRPAUSE, it's natural to ask how well it works. To help answer that question I wrote a very simple C/pthreads benchmark that launches 8 concurrent threads and binds those threads to processors 0..7. The processors are numbered geographically on the T4, so those threads will all be running on just one core. Unlike the SPARC T2, where logical CPUs 0,1,2 and 3 were assigned to the first pipeline, and CPUs 4,5,6 and 7 were assigned to the 2nd, there's no fixed mapping between CPUs and pipelines in the T4. And in some circumstances when the other 7 logical processors are idling quietly, it's possible for the remaining logical processor to leverage both pipelines. Some number T of the threads will iterate in a tight loop advancing a simple Marsaglia xor-shift pseudo-random number generator. T is a command-line argument. The main thread loops, reporting the aggregate number of PRNG steps performed collectively by those T threads in the last 10 second measurement interval. The other threads (there are 8-T of these) run in a loop busy-waiting concurrently with the T threads. We vary T between 1 and 8 threads, and report on various busy-waiting idioms. The values in the table are the aggregate number of PRNG steps completed by the set of T threads. The unit is millions of iterations per 10 seconds. For the "PRNG step" busy-waiting mode, the busy-waiting threads execute exactly the same code as the T worker threads. We can easily compute the average rate of progress for individual worker threads by dividing the aggregate score by the number of worker threads T. I should note that the PRNG steps are extremely cycle-heavy and access almost no memory, so arguably this microbenchmark is not as representative of "normal" code as it could be. And for the purposes of comparison I included a row in the table that reflects a waiting policy where the waiting threads call poll(NULL,0,1000) and block in the kernel. Obviously this isn't busy-waiting, but the data is interesting for reference. _table { border:2px black dotted; margin: auto; width: auto; } _tr { border: 2px red dashed; } _td { border: 1px green solid; } _table { border:2px black dotted; margin: auto; width: auto; } _tr { border: 2px red dashed; } td { background-color : #E0E0E0 ; text-align : right ; } th { text-align : left ; } td { background-color : #E0E0E0 ; text-align : right ; } th { text-align : left ; } Aggregate progress T = #worker threads Wait Mechanism for 8-T threadsT=1T=2T=3T=4T=5T=6T=7T=8 Park thread in poll() 32653347334833483348334833483348 no-op 415 831 124316482060249729303349 RD %ccr,%g0 "pause" 14262429269228623013316232553349 PRNG step 412 829 124616702092251029303348 WRPause(8000) 32443361333133483349334833483348 WRPause(4000) 32153308331533223347334833473348 WRPause(1000) 30853199322432513310334833483348 WRPause(500) 29173070315032223270330933483348 WRPause(250) 26942864294930773205338833483348 WRPause(100) 21552469262227902911321433303348

Read the article
Actor based concurrency and cancellation

- by Akash

I'm reading about actor based concurrency and I appreciate the simplicity of actors sequentially processing messages on a single thread. However there is one scenario that doesn't seen possible. Suppose that actor A sends a message to actor B, who then performs some long running task and returns a completion message to actor A. How can actor A force actor B to cancel the long running task after it has started? If actor B is running the task in its message queue thread, it won't pick up the cancellation message until it had completed the task; if actor B runs the task in a background thread then it seems to be violating the principle of actors. Is there a common way that this scenario is handled with actors? Or does each actor language/framework take a different approach? Or is this not a suitable problem to tackle via actors?

Read the article
Rewrite Generic URLs into real URLs on Google Analytics

- by valdroni

I have an iPhone app for a forum which also has a limited Google Analytics reporting. This app reports the page views in following generic form: /forum/67 /thread/29036 etc... The numbers above represent forum and thread ID's I am trying to set an Advanced filter, which will rewrite/report the page views in Google Analytics in following form: http://www.mysite.com/forum-67.html http://www.mysite.com/thread-29036.html Can someone please assist me in creating an Advanced Google Analytics filter which will enable me to see URL's so they can be live and send to correct page. Is there another method to achieve what I'm looking for ? Obviously there will be a need for some RegExp matches, but I cannot get around it.

Read the article
One True Event Loop

- by CyberShadow

Simple programs that collect data from only one system need only one event loop. For example, Windows applications have the message loop, POSIX network programs usually have a select/epoll/etc. loop at their core, pure SDL games use SDL's event loop. But what if you need to collect events from several subsystems? Such as an SDL game which doesn't use SDL_net for networking. I can think of several solutions: Polling (ugh) Put each event loop in its own thread, and: Send messages to the main thread, which collects and processes the events, or Place the event-processing code of each thread in a critical section, so that the threads can wait for events asynchronously but process them synchronously Choose one subsystem for the main event loop, and pass events from other subsystems via that subsystem as custom messages (for example, the Windows message loop and custom messages, or a socket select() loop and passing events via a loopback connection). Option 2.1 is more interesting on platforms where message-passing is a well-developed threading primitive (e.g. in the D programming language), but 2.2 looks like the best option to me.

Read the article
What's the difference between 'killall' and 'pkill'?

- by jgbelacqua

After using just plain kill <some_pid> on Unix systems for many years, I learned pkill from a younger Linux-savvy co-worker colleague1. I soon accepted the Linux-way, pgrep-ing and pkill-ing through many days and nights, through slow-downs and race conditions. This was all well and good. But now I see nothing but killall . How-to's seem to only mention killall, and I'm not sure if this is some kind of parallel development, or if killall is a successor to pkill, or something else. It seems to function as more targeted pkill, but I'm sure I'm missing something. Can an Ubuntu/Debian-savvy person explain when (or why) killall should be used, especially if it should be used in preference to pkill (when pkill often seems easier, because I can be sloppier with name matching, at least by default). 1 'colleague' is free upgrade from 'co-worker', so might as well.

Read the article
What is So Unique About Node.js?

- by Adrian Shum

Recently there has been a lot of praise for Node.js. I am not a developer that has had much exposure to network application. From my bare understanding of Nodes.js, its strength is: we have only one thread handling multiple connections, providing an event-based architecture. However, for example in Java, I can create only one thread using NIO/AIO (which is non-blocking APIs from my bare understanding), and handle multiple connections using that thread, and I provide an event-based architecture to implement the data handling logic (shouldn't be that difficult by providing some callback etc) ? Given JVM being a even more mature VM than V8 (I expect it to run faster too), and event-based handling architecture seems to be something not difficult to create, I am not sure why Node.js is attracting so much attention. Did I miss some important points?

Read the article
Ubuntu 12.04 x64 LTS VPN Server not changing IP

- by user288778

I used this guide http://silverlinux.blogspot.co.uk/2012/05/how-to-pptp-vpn-on-ubuntu-1204-pptpd.html and it worked fine. I'm able to connect but the problem is, that my IP being changed to "localip" not "remote ip". This is what I get from tail -f /var/log/syslog [code] June 6 00:09:19 instant5860 NetworkManager[1456]: Unmanaged Device found; state CONNECTED forced (see http://bugs.launchpad.net/bugs/191889) June 6 00:09:19 instant5860 NetworkManager[1456]: Marking connection 'Wired connection 1' invalid. June 6 00:09:19 instant5860 NetworkManager[1456]: Activation (eth1) failed. June 6 00:09:19 instant5860 NetworkManager[1456]: Activation (eth1) Stage 4 of 5 (IPv4 Configure Timeout) complete. June 6 00:09:19 instant5860 NetworkManager[1456]: (eth1): device state change: failed - disconnected (reason 'none') [120 30 0] June 6 00:09:19 instant5860 NetworkManager[1456]: (eth1): deactivating device (reason 'none') [0] June 6 00:09:19 instant5860 NetworkManager[1456]: Unmanaged Device found; state CONNECTED forced. June----- avahi-daemon[440]: Withdrawing address record for fe80......... on eth1 Jun------avahi-daemon[440]: Leaving mDNS multicast group on interface eth1. IPv6 with address fe80..... Jun------avahi-daemon[440]: Interface eth1.IPv6 no longer relevant for mDNS. Jun------avahi-daemon[440]: Joining mDNS multicast group on interface eth1.IPv6 with address fe80.... Jun------avahi-daemon[440]: New relevant interface eth1.IPv6 for mDNS Jun------avahi-daemon[440]: Registering new address record for fe80..... on eth1.*. Jun - snmpd[1172]: error on subcontainer 'ia_addr' insert (-1) dbusp382]: [syste] Activating service name='org.freedesktop.PackageKit' (using servicehelper) AptDaemon: INFO: Initializing daemon AptDaemon.PackageKit: INFO: Initializing PackageKit compat layer dbus[382]: [system] Successfu;;y activated service 'org.freedesktop.PackageKit' AptDaemon.PackageKit: INFO: Initializing PackageKit transaction AptDaemon.Worker: INFO: Simulating trans: /org/debian/apt/transaction/233beca013a0473ea34d9dea805af5df AptDaemon.Worker: INFO: Processing transaction /org/debian/apt... AptDaemon.PackageKit: INFO: Get updates() AptDaemon.Worker: INFO: Finished snmpd[1172]: error on subcontainer pptpd[23611]: CTRL: Client 82.33.... control connection started pptpd[23611]: CTRL: Starting call (launching pppd, opening GRE) pptpd[23611]: pppd 2.4.5 started by root uid 0 pptpd[23611]: Using interface ppp0 pptpd[23611]: Connect ppp0 <-- /dev/pts/1 NetworkManager[1456]: SCPlugin - Ifupdown: device added (path: /sys/devices/virtual/net/ppp0, iface: ppp0) NetworkManager[1456]:SCPlugin - Ifupdown: device added (path: /sys/devices/virtual/net/ppp0, iface: ppp0): no ifupdown configuration found. pptpd[23612]: peer from calling number 82... authorized. kernel: [2918261.416923] init: ufw pre-start process (23613) terminated with status 1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 7 CTRL: Ignored a SET LING info packet with real ACCMs! local IP address:109.0.121.197 remote IP address: 109.0.84.56 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 13 NetworkManager[1456]: (eth1): DHCPv4 request timed out. NetworkManager[1456]: (eth1): canceled DHCP transaction, DHCP client pid 23280 NetworkManager[1456]: Activation (eth1) Stage 4 of 5 (IPv4 Configure Timeout) scheduled... NetworkManager[1456]: Activation (eth1) Stage 4 of 5 (IPv4 Configure Timeout) started... NetworkManager[1456]: (eth1): device state change: ip-config - failed (reason 'ip-config-unavailable') [70 120 5[ NetworkManager[1456]: Unmanaged 'ia_addr' insert (-1)[/code]

Read the article
Why Nodes.js being that "unique"?

- by Adrian Shum

Recently years there are lots of praise to Nodes.js. I am not a developer that have much exposure on network application. From my bare understanding of Nodes.js, its strength is: We are having only on thread handling multiple connections, providing a event-based architecture. However, for example in Java, what if I am having only one thread, using NIO/AIO (which is non-blocking APIs from my bare understanding), and handle multiple connections using that thread, and I provide an event-based architecture to implement the data handling logic (shouldn't be that difficult by providing some callback etc) ? Given JVM being a even more mature VM than V8 (I expect it run faster too), and event-based handling architecture seems not something difficult to create. I am not sure why Nodes.js is attracting so much attention. Did I miss some important points?

Read the article
Windows Azure RoleEntryPoint Method Call Order

- by kaleidoscope

Worker Role Call Order: WaWorkerHost process is started. Worker Role assembly is loaded and surfed for a class that derives from RoleEntryPoint. This class is instantiated. RoleEntryPoint.OnStart() is called. RoleEntryPoint.Run() is called. If the RoleEntryPoint.Run() method exits, the RoleEntryPoint.OnStop() method is called . WaWorkerHost process is stopped. The role will recycle and startup again. Web Role Call Order: WaWebHost process is started. Hostable Web Core is activated. Web role assembly is loaded and RoleEntryPoint.OnStart() is called. Global.Application_Start() is called. The web application runs Global.Application_End() is called. RoleEntryPoint.OnStop() is called. Hostable Web Core is deactivated. WaWebHost process is stopped. For Further Reference: http://blogs.msdn.com/jnak/archive/2010/02/11/windows-azure-roleentrypoint-method-call-order.aspx Tinu, O

Read the article
How to avoid “The web server process that was being debugged has been terminated by IIS”

- by ybbest

Problem: When debugging asp.net by attaching w3wp.exe process, you often will see encounter the following error message, the web server process that was being debugged has been terminated by IIS. Analysis: This caused Internet Information Services (IIS) to assume that the worker process had stopped responding. Therefore, IIS terminated the worker process. Solution: 1. Open IIS manager. 2.Click application Pools>>select the application pool associated with the site>>and click Advanced Settings 3. Click Advanced Settings of the application pool and set the Ping Enabled property from True to False. Now, reattach the process from Visual Studio, you should not get the error message. References: msdn

Read the article
Apt-Daemon problem due to a broken sun-java6-jre package

- by Marv

I am having problems with installation with everything in the software center. Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/aptdaemon/worker.py", line 968, in simulate trans.unauthenticated = self._simulate_helper(trans) File "/usr/lib/python2.7/dist-packages/aptdaemon/worker.py", line 1092, in _simulate_helper return depends, self._cache.required_download, \ File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 235, in required_download pm.get_archives(fetcher, self._list, self._records) SystemError: E:I wasn't able to locate a file for the sun-java6-jre package. This might mean you need to manually fix this package. Help?

Read the article
Threading iPhone

- by bobobobo

Say I have a group of large meshes that I have to intersect rays against. Assume also, for whatever reason, I cannot further simplify/reduce poly check count by spatial subdivisioning. I can do this in parallel: bool intersects( list of meshes ) // a mesh is a group of triangles { create n threads foreach mesh in meshes assign to a thread in threads wait until ( threads.run() ) ; // run asynchronously // when they're all done // pull out intersected triangles // from per-thread context data } Can you do this in ios for games? Or is the overhead of thread creation and mutex waiting going to beat-out the benefit of multithreading?

Read the article
New Endpoint options that enable additional application patterns

- by kaleidoscope

The two communication-related capabilities: a) inter-role communication and b) external endpoints on worker roles enable new application patterns in Windows Azure-hosted services. Inter-role Communication - A common application pattern enabled by this is client-server, where the server could be an application such as a database or a memory cache. External Endpoints on Worker Roles - A common application type enabled by this is a self-hosted Internet-exposed service, such as a custom application server. For further details click on the following link: http://blogs.msdn.com/windowsazure/archive/2009/11/24/new-endpoint-options-enable-additional-application-patterns.aspx Tinu, O

Read the article
DIY Touch Screen Mod Makes Regular Gloves Smartphone-friendly

- by Jason Fitzpatrick

Smartphone-friendly winter gloves are expensive (and often ugly). Skip shelling out for store-bought gloves when, armed with a needle and thread, you can turn any gloves into smartphone-friendly ones. Over at Popular Science, Taylor Kubota shares the simple trick: 1. Order silver-plated nylon thread (silver conducts electricity). This can be difficult to find in stores, but major online retailers carry it. 2. Pick a pair of gloves to modify. Although leather works, it’s harder to push a needle through. 3. Stitch the figure of a star or other solid shape onto the glove’s index finger with the thread, making sure it will contact both the touchscreen and your skin. Our Geek Trivia App for Windows 8 is Now Available Everywhere How To Boot Your Android Phone or Tablet Into Safe Mode HTG Explains: Does Your Android Phone Need an Antivirus?

Read the article
Is it safe to run multiple XNA ContentManager instances on multiple threads?

- by Boinst

My XNA project currently uses one ContentManager instance, and one dedicated background thread for loading all content. I wonder, would it be safe to have multiple ContentManager instances, each in it's own dedicated thread, loading different content at the same time? I'm prompted to ask this question because this article makes the following statement: If there are two textures created at the same time on different threads, they will clobber the other and you will end up with some garbage in the textures. I think that what the author is saying here, is that if I access one ContentManager simultaneously on two threads, I'll get garbage. But what if I have separate ContentManager instances for each thread? If no-one knows the answer already from experience, I'll go ahead and try it and see what happens.

Read the article
Programming error in 'aptdaemon' [closed]

- by Real

Using Ubuntu 11.10 While performing updates through the update manager I get the following message: An unhandlable error occured There seems to be a programming error in aptdaemon, the software that allows you to install/remove software and to perform other package management related tasks. Details Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/aptdaemon/worker.py", line 968, in simulate trans.unauthenticated = self._simulate_helper(trans) File "/usr/lib/python2.7/dist-packages/aptdaemon/worker.py", line 1092, in _simulate_helper return depends, self._cache.required_download, \ File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 235, in required_download pm.get_archives(fetcher, self._list, self._records) SystemError: E:Method has died unexpectedly!, E:Sub-process returned an error code (100), E:Method /usr/lib/apt/methods/ did not start correctly Tried some of the fixes that were posted but did not work. What shall I do to fix this issue?

Read the article
3 threads printing numbers of different range

- by user875036

This question was asked in an Electronic Arts interview: There are 3 threads. The first thread prints the numbers 1 to 10. The second thread prints the numbers 11 to 20. The third thread prints the numbers from from 21 to 30. Now, all three threads are running. The numbers are printed in an irregular order like 1, 11, 2, 21, 12 etc. If I want numbers to be printed in sorted order like 1, 2, 3, 4..., what should I do with these threads?

Read the article
i cant download things from software center

- by mark

i keep getting this error when i want to get an app from software crnter File "/usr/lib/python2.7/dist-packages/aptdaemon/worker.py", line 972, in simulate trans.unauthenticated = self._simulate_helper(trans) File "/usr/lib/python2.7/dist-packages/aptdaemon/worker.py", line 1096, in _simulate_helper return depends, self._cache.required_download, \ File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 235, in required_download pm.get_archives(fetcher, self._list, self._records) SystemError: E:I wasn't able to locate a file for the sun-java6-jre package. This might mean you need to manually fix this package. any one help please!!!!!!!! how do i manually fix!

Read the article
Do games use threads?

- by Nubcake

I understand that the concept of how a game runs i.e while (game_loop = true) { //handle events // input/output/sound etc } But it has come to my attention while programming in another HLL is do some games use threads for certain operations? For example take any Pokemon game ; during interaction a textbox appears to display information. Now I've been trying to simulate that sort of textbox and the only way I could have got it to be exactly the same is by using a loop and yes once a loop is started there is no way to handle window events unless they are handled again inside the loop itself. I couldn't have used this loop inside a different thread other than the main one (due to a DirectX limitation) so the only option was to use it inside the main program thread. I was wondering if some games work like this ; do they only use the main program thread and handle events again if they're inside a loop? Edit: I forgot to mention this is about console games not PC games! Thanks Nubcake

Read the article
Dynamically loading modules in Python (+ multi processing question)

- by morpheous

I am writing a Python package which reads the list of modules (along with ancillary data) from a configuration file. I then want to iterate through each of the dynamically loaded modules and invoke a do_work() function in it which will spawn a new process, so that the code runs ASYNCHRONOUSLY in a separate process. At the moment, I am importing the list of all known modules at the beginning of my main script - this is a nasty hack I feel, and is not very flexible, as well as being a maintenance pain. This is the function that spawns the processes. I will like to modify it to dynamically load the module when it is encountered. The key in the dictionary is the name of the module containing the code: def do_work(work_info): for (worker, dataset) in work_info.items(): #import the module defined by variable worker here... # [Edit] NOT using threads anymore, want to spawn processes asynchronously here... #t = threading.Thread(target=worker.do_work, args=[dataset]) # I'll NOT dameonize since spawned children need to clean up on shutdown # Since the threads will be holding resources #t.daemon = True #t.start() Question 1 When I call the function in my script (as written above), I get the following error: AttributeError: 'str' object has no attribute 'do_work' Which makes sense, since the dictionary key is a string (name of the module to be imported). When I add the statement: import worker before spawning the thread, I get the error: ImportError: No module named worker This is strange, since the variable name rather than the value it holds are being used - when I print the variable, I get the value (as I expect) whats going on? Question 2 As I mentioned in the comments section, I realize that the do_work() function written in the spawned children needs to cleanup after itself. My understanding is to write a clean_up function that is called when do_work() has completed successfully, or an unhandled exception is caught - is there anything more I need to do to ensure resources don't leak or leave the OS in an unstable state? Question 3 If I comment out the t.daemon flag statement, will the code stil run ASYNCHRONOUSLY?. The work carried out by the spawned children are pretty intensive, and I don't want to have to be waiting for one child to finish before spawning another child. BTW, I am aware that threading in Python is in reality, a kind of time sharing/slicing - thats ok Lastly is there a better (more Pythonic) way of doing what I'm trying to do? [Edit] After reading a little more about Pythons GIL and the threading (ahem - hack) in Python, I think its best to use separate processes instead (at least IIUC, the script can take advantage of multiple processes if they are available), so I will be spawning new processes instead of threads. I have some sample code for spawning processes, but it is a bit trivial (using lambad functions). I would like to know how to expand it, so that it can deal with running functions in a loaded module (like I am doing above). This is a snippet of what I have: def do_mp_bench(): q = mp.Queue() # Not only thread safe, but "process safe" p1 = mp.Process(target=lambda: q.put(sum(range(10000000)))) p2 = mp.Process(target=lambda: q.put(sum(range(10000000)))) p1.start() p2.start() r1 = q.get() r2 = q.get() return r1 + r2 How may I modify this to process a dictionary of modules and run a do_work() function in each loaded module in a new process?

Read the article
C#/.NET Little Wonders: Interlocked CompareExchange()

- by James Michael Hare

Once again, in this series of posts I look at the parts of the .NET Framework that may seem trivial, but can help improve your code by making it easier to write and maintain. The index of all my past little wonders posts can be found here. Two posts ago, I discussed the Interlocked Add(), Increment(), and Decrement() methods (here) for adding and subtracting values in a thread-safe, lightweight manner. Then, last post I talked about the Interlocked Read() and Exchange() methods (here) for safely and efficiently reading and setting 32 or 64 bit values (or references). This week, we’ll round out the discussion by talking about the Interlocked CompareExchange() method and how it can be put to use to exchange a value if the current value is what you expected it to be. Dirty reads can lead to bad results Many of the uses of Interlocked that we’ve explored so far have centered around either reading, setting, or adding values. But what happens if you want to do something more complex such as setting a value based on the previous value in some manner? Perhaps you were creating an application that reads a current balance, applies a deposit, and then saves the new modified balance, where of course you’d want that to happen atomically. If you read the balance, then go to save the new balance and between that time the previous balance has already changed, you’ll have an issue! Think about it, if we read the current balance as $400, and we are applying a new deposit of $50.75, but meanwhile someone else deposits $200 and sets the total to $600, but then we write a total of $450.75 we’ve lost $200! Now, certainly for int and long values we can use Interlocked.Add() to handles these cases, and it works well for that. But what if we want to work with doubles, for example? Let’s say we wanted to add the numbers from 0 to 99,999 in parallel. We could do this by spawning several parallel tasks to continuously add to a total: 1: double total = 0; 2: 3: Parallel.For(0, 10000, next => 4: { 5: total += next; 6: }); Were this run on one thread using a standard for loop, we’d expect an answer of 4,999,950,000 (the sum of all numbers from 0 to 99,999). But when we run this in parallel as written above, we’ll likely get something far off. The result of one of my runs, for example, was 1,281,880,740. That is way off! If this were banking software we’d be in big trouble with our clients. So what happened? The += operator is not atomic, it will read in the current value, add the result, then store it back into the total. At any point in all of this another thread could read a “dirty” current total and accidentally “skip” our add. So, to clean this up, we could use a lock to guarantee concurrency: 1: double total = 0.0; 2: object locker = new object(); 3: 4: Parallel.For(0, count, next => 5: { 6: lock (locker) 7: { 8: total += next; 9: } 10: }); Which will give us the correct result of 4,999,950,000. One thing to note is that locking can be heavy, especially if the operation being locked over is trivial, or the life of the lock is a high percentage of the work being performed concurrently. In the case above, the lock consumes pretty much all of the time of each parallel task – and the task being locked on is relatively trivial. Now, let me put in a disclaimer here before we go further: For most uses, lock is more than sufficient for your needs, and is often the simplest solution! So, if lock is sufficient for most needs, why would we ever consider another solution? The problem with locking is that it can suspend execution of your thread while it waits for the signal that the lock is free. Moreover, if the operation being locked over is trivial, the lock can add a very high level of overhead. This is why things like Interlocked.Increment() perform so well, instead of locking just to perform an increment, we perform the increment with an atomic, lockless method. As with all things performance related, it’s important to profile before jumping to the conclusion that you should optimize everything in your path. If your profiling shows that locking is causing a high level of waiting in your application, then it’s time to consider lighter alternatives such as Interlocked. CompareExchange() – Exchange existing value if equal some value So let’s look at how we could use CompareExchange() to solve our problem above. The general syntax of CompareExchange() is: T CompareExchange<T>(ref T location, T newValue, T expectedValue) If the value in location == expectedValue, then newValue is exchanged. Either way, the value in location (before exchange) is returned. Actually, CompareExchange() is not one method, but a family of overloaded methods that can take int, long, float, double, pointers, or references. It cannot take other value types (that is, can’t CompareExchange() two DateTime instances directly). Also keep in mind that the version that takes any reference type (the generic overload) only checks for reference equality, it does not call any overridden Equals(). So how does this help us? Well, we can grab the current total, and exchange the new value if total hasn’t changed. This would look like this: 1: // grab the snapshot 2: double current = total; 3: 4: // if the total hasn’t changed since I grabbed the snapshot, then 5: // set it to the new total 6: Interlocked.CompareExchange(ref total, current + next, current); So what the code above says is: if the amount in total (1st arg) is the same as the amount in current (3rd arg), then set total to current + next (2nd arg). This check and exchange pair is atomic (and thus thread-safe). This works if total is the same as our snapshot in current, but the problem, is what happens if they aren’t the same? Well, we know that in either case we will get the previous value of total (before the exchange), back as a result. Thus, we can test this against our snapshot to see if it was the value we expected: 1: // if the value returned is != current, then our snapshot must be out of date 2: // which means we didn't (and shouldn't) apply current + next 3: if (Interlocked.CompareExchange(ref total, current + next, current) != current) 4: { 5: // ooops, total was not equal to our snapshot in current, what should we do??? 6: } So what do we do if we fail? That’s up to you and the problem you are trying to solve. It’s possible you would decide to abort the whole transaction, or perhaps do a lightweight spin and try again. Let’s try that: 1: double current = total; 2: 3: // make first attempt... 4: if (Interlocked.CompareExchange(ref total, current + i, current) != current) 5: { 6: // if we fail, go into a spin wait, spin, and try again until succeed 7: var spinner = new SpinWait(); 8: 9: do 10: { 11: spinner.SpinOnce(); 12: current = total; 13: } 14: while (Interlocked.CompareExchange(ref total, current + i, current) != current); 15: } 16: This is not trivial code, but it illustrates a possible use of CompareExchange(). What we are doing is first checking to see if we succeed on the first try, and if so great! If not, we create a SpinWait and then repeat the process of SpinOnce(), grab a fresh snapshot, and repeat until CompareExchnage() succeeds. You may wonder why not a simple do-while here, and the reason it’s more efficient to only create the SpinWait until we absolutely know we need one, for optimal efficiency. Though not as simple (or maintainable) as a simple lock, this will perform better in many situations. Comparing an unlocked (and wrong) version, a version using lock, and the Interlocked of the code, we get the following average times for multiple iterations of adding the sum of 100,000 numbers: 1: Unlocked money average time: 2.1 ms 2: Locked money average time: 5.1 ms 3: Interlocked money average time: 3 ms So the Interlocked.CompareExchange(), while heavier to code, came in lighter than the lock, offering a good compromise of safety and performance when we need to reduce contention. CompareExchange() - it’s not just for adding stuff… So that was one simple use of CompareExchange() in the context of adding double values -- which meant we couldn’t have used the simpler Interlocked.Add() -- but it has other uses as well. If you think about it, this really works anytime you want to create something new based on a current value without using a full lock. For example, you could use it to create a simple lazy instantiation implementation. In this case, we want to set the lazy instance only if the previous value was null: 1: public static class Lazy<T> where T : class, new() 2: { 3: private static T _instance; 4: 5: public static T Instance 6: { 7: get 8: { 9: // if current is null, we need to create new instance 10: if (_instance == null) 11: { 12: // attempt create, it will only set if previous was null 13: Interlocked.CompareExchange(ref _instance, new T(), (T)null); 14: } 15: 16: return _instance; 17: } 18: } 19: } So, if _instance == null, this will create a new T() and attempt to exchange it with _instance. If _instance is not null, then it does nothing and we discard the new T() we created. This is a way to create lazy instances of a type where we are more concerned about locking overhead than creating an accidental duplicate which is not used. In fact, the BCL implementation of Lazy<T> offers a similar thread-safety choice for Publication thread safety, where it will not guarantee only one instance was created, but it will guarantee that all readers get the same instance. Another possible use would be in concurrent collections. Let’s say, for example, that you are creating your own brand new super stack that uses a linked list paradigm and is “lock free”. We could use Interlocked.CompareExchange() to be able to do a lockless Push() which could be more efficient in multi-threaded applications where several threads are pushing and popping on the stack concurrently. Yes, there are already concurrent collections in the BCL (in .NET 4.0 as part of the TPL), but it’s a fun exercise! So let’s assume we have a node like this: 1: public sealed class Node<T> 2: { 3: // the data for this node 4: public T Data { get; set; } 5: 6: // the link to the next instance 7: internal Node<T> Next { get; set; } 8: } Then, perhaps, our stack’s Push() operation might look something like: 1: public sealed class SuperStack<T> 2: { 3: private volatile T _head; 4: 5: public void Push(T value) 6: { 7: var newNode = new Node<int> { Data = value, Next = _head }; 8: 9: if (Interlocked.CompareExchange(ref _head, newNode, newNode.Next) != newNode.Next) 10: { 11: var spinner = new SpinWait(); 12: 13: do 14: { 15: spinner.SpinOnce(); 16: newNode.Next = _head; 17: } 18: while (Interlocked.CompareExchange(ref _head, newNode, newNode.Next) != newNode.Next); 19: } 20: } 21: 22: // ... 23: } Notice a similar paradigm here as with adding our doubles before. What we are doing is creating the new Node with the data to push, and with a Next value being the original node referenced by _head. This will create our stack behavior (LIFO – Last In, First Out). Now, we have to set _head to now refer to the newNode, but we must first make sure it hasn’t changed! So we check to see if _head has the same value we saved in our snapshot as newNode.Next, and if so, we set _head to newNode. This is all done atomically, and the result is _head’s original value, as long as the original value was what we assumed it was with newNode.Next, then we are good and we set it without a lock! If not, we SpinWait and try again. Once again, this is much lighter than locking in highly parallelized code with lots of contention. If I compare the method above with a similar class using lock, I get the following results for pushing 100,000 items: 1: Locked SuperStack average time: 6 ms 2: Interlocked SuperStack average time: 4.5 ms So, once again, we can get more efficient than a lock, though there is the cost of added code complexity. Fortunately for you, most of the concurrent collection you’d ever need are already created for you in the System.Collections.Concurrent (here) namespace – for more information, see my Little Wonders – The Concurent Collections Part 1 (here), Part 2 (here), and Part 3 (here). Summary We’ve seen before how the Interlocked class can be used to safely and efficiently add, increment, decrement, read, and exchange values in a multi-threaded environment. In addition to these, Interlocked CompareExchange() can be used to perform more complex logic without the need of a lock when lock contention is a concern. The added efficiency, though, comes at the cost of more complex code. As such, the standard lock is often sufficient for most thread-safety needs. But if profiling indicates you spend a lot of time waiting for locks, or if you just need a lock for something simple such as an increment, decrement, read, exchange, etc., then consider using the Interlocked class’s methods to reduce wait. Technorati Tags: C#,CSharp,.NET,Little Wonders,Interlocked,CompareExchange,threading,concurrency

Read the article

Search Results

Search found 9137 results on 366 pages for 'worker thread'.

Page 86/366 | < Previous Page | 82 83 84 85 86 87 88 89 90 91 92 93 | Next Page >

- by Dave

- by Clint Edmonson

- by Marc Schumacher

- by SAMIR BHOGAYTA

- by Dave

- by Akash

- by valdroni

- by CyberShadow

- by jgbelacqua

- by Adrian Shum

- by user288778

- by Adrian Shum

- by kaleidoscope

- by ybbest

- by Marv

- by bobobobo

- by kaleidoscope

- by Jason Fitzpatrick

- by Boinst

- by Real

- by user875036

- by mark

- by Nubcake

- by morpheous

- by James Michael Hare

< Previous Page | 82 83 84 85 86 87 88 89 90 91 92 93 | Next Page >