cores - Page 29 - Developer IT

Problem: Vectorizing Code with Intel Visual FORTRAN for X64

- by user313209

I'm compiling my fortran90 code using Intel Visual FORTRAN on Windows Server 2003 Enterprise X64 Edition. When I compile the code for 32 bit structure and using automatic and manual vectorizing options. The code will be compiled, vectorized. And when I run it on 8 core system the compiled code uses 70% of CPU that shows me that vectorizing is working. But when I compile the code with 64 Bit compiler, it says that the code is vectorized but when I run it it only shows CPU usage of about 12% that is full usage for one core out of 8, so it means that while the compiler says that code is vectorized, vectorization is not working. And it's strange for me because it's on a X64 Edition Windows and I was expecting to see the reverse result. I thought that it should be better to run a code that is compiled for 64 Bit architecture on a 64 bit windows. Anyone have any idea why the compiled code is not able to use the full power of multiple cores for 64 Bit Compiled version? Thanks in advance for your responses.

Read the article

Help with Neuroph neural network

- by user359708

For my graduate research I am creating a neural network that trains to recognize images. I am going much more complex than just taking a grid of RGB values, downsampling, and and sending them to the input of the network, like many examples do. I actually use over 100 independently trained neural networks that detect features, such as lines, shading patterns, etc. Much more like the human eye, and it works really well so far! The problem is I have quite a bit of training data. I show it over 100 examples of what a car looks like. Then 100 examples of what a person looks like. Then over 100 of what a dog looks like, etc. This is quite a bit of training data! Currently I am running at about one week to train the network. This is kind of killing my progress, as I need to adjust and retrain. I am using Neuroph, as the low-level neural network API. I am running a dual-quadcore machine(16 cores with hyperthreading), so this should be fast. My processor percent is at only 5%. Are there any tricks on Neuroph performance? Or Java peroformance in general? Suggestions? I am a cognitive psych doctoral student, and I am decent as a programmer, but do not know a great deal about performance programming.

Read the article

Perl Parallel::ForkManager wait_all_children() takes excessively long time

- by zhang18

I have a script that uses Parallel::ForkManager. However, the wait_all_children() process takes incredibly long time even after all child-processes are completed. The way I know is by printing out some timestamps (see below). Does anyone have any idea what might be causing this (I have 16 CPU cores on my machine)? my $pm = Parallel::ForkManager->new(16) for my $i (1..16) { $pm->start($i) and next; ... do something within the child-process ... print (scalar localtime), " Process $i completed.\n"; $pm->finish(); } print (scalar localtime), " Waiting for some child process to finish.\n"; $pm->wait_all_children(); print (scalar localtime), " All processes finished.\n"; Clearly, I'll get the Waiting for some child process to finish message first, with a timestamp of, say, 7:08:35. Then I'll get a list of Process i completed messages, with the last one at 7:10:30. However, I do not receive the message All Processes finished until 7:16:33(!). Why is that 6-minute delay between 7:10:30 and 7:16:33? Thx!

Read the article

C# Multithreaded Domain Design

- by Thijs Cramer

Let's say i have a Domain Model that im trying to make compatible with multithreading. The prototype domain is a game domain that consists of Space, SpaceObject, and Location objects. A SpaceObject has the Move method and Asteroid and Ship extend this object with specific properties for the object (Ship has a name and Asteroid has a color) Let's say i want to make the Move method for each object run in a seperate thread. That would be stupid because with 10000 objects, i would have 10000 threads. What would be the best way to seperate the workload between cores/threads? I'm trying to learn the basics of concurrency, and building a small game to prototype a lot of concepts. What i've already done, is build a domain, and a threading model with a timer that launches events based on intervals. If the event occurs i want to update my entire model with the new locations of any SpaceObject. But i don't know how and when to launch new threads with workloads when the event occurs. Some people at work told me that u can't update your core domain multithreaded, because you have to synch everything. But in that case i can't run my game on a dual quadcore server, because it would only use 1 CPU for the hardest tasks. Anyone know what to do here?

Read the article

How to mult-thread this?

- by WilliamKF

I wish to have two threads. The first thread1 occasionally calls the following pseudo function: void waitForThread2() { if (thread2 is not idle) { return; } notifyThread2IamReady(); while (thread2IsExclusive) { } } The second thread2 is forever in the following pseudo loop: for (;;) { Notify thread1 I am idle. while (!thread1IsReady()) { } Notify thread1 I am exclusive. Do some work while thread1 is blocked. Notify thread1 I am busy. Do some work in parallel with thread1. } What is the best way to write this such that both thread1 and thread2 are kept as busy as possible on a machine with multiple cores. I would like to avoid long delays between notification in one thread and detection by the other. I tried using pthread condition variables but found the delay between thread2 doing 'notify thread1 I am busy' and the loop in waitForThread2() on thear2IsExclusive() can be up to almost one second delay. I then tried using a volatile sig_atomic_t shared variable to control the same, but something is going wrong, so I must not be doing it correctly.

Read the article

What is optimal hardware configuration for heavy load LAMP application

- by Piotr Kochanski

I need to run Linux-Apache-PHP-MySQL application (Moodle e-learning platform) for a large number of concurrent users - I am aiming 5000 users. By concurrent I mean that 5000 people should be able to work with the application at the same time. "Work" means not only do database reads but writes as well. The application is not very typical, since it is doing a lot of inserts/updates on the database, so caching techniques are not helping to much. We are using InnoDB storage engine. In addition application is not written with performance in mind. For instance one Apache thread usually occupies about 30-50 MB of RAM. I would be greatful for information what hardware is needed to build scalable configuration that is able to handle this kind of load. We are using right now two HP DLG 380 with two 4 core processors which are able to handle much lower load (typically 300-500 concurrent users). Is it reasonable to invest in this kind of boxes and build cluster using them or is it better to go with some more high-end hardware? I am particularly curious how many and how powerful servers are needed (number of processors/cores, size of RAM) what network equipment should be used (what kind of switches, network cards) any other hardware, like particular disc storage solutions, etc, that are needed Another thing is how to put together everything, that is what is the most optimal architecture. Clustering with MySQL is rather hard (people are complaining about MySQL Cluster, even here on Stackoverflow).

Read the article

Are mathematical Algorithms protected by copyright?

- by analogy

I wish to implement an algorithm which i read in a journal paper in my software (commercial). I want to know if this is allowed or not. The algorithm in question is described in http://arxiv.org/abs/0709.2938 It is a very simple algorithm and a number of implementations exist in python (http://igraph.sourceforge.net/) and java. One of them is in gpl another which i got from a different researcher and had no license attached. There are significant differences in two implementations, e.g. second one uses threads and multiple cores. It is possible to rewrite/ (not translate) the algorithm. So can I use it in my software or on a server for commercial purpose. Thanks UPDATE: I am completely aware of copyright on the text of paper, it was published in phys rev E. I am concerned with use of the algorithm, in commercial software. Also the publication means that unless the patent has been already filed. The method has been disclosed publicly hence barring patent in future. Also the GPL implementation is not by authors themselves but comes from a third party. Finally i am not using the GPL implementation but creating my own using C++.

Read the article

Replace low level web-service reference call transport with custom one

- by hoodoos

I'm not sure if title sounds right actually, so I will give more explanation here. I will begin from very beginning :) I'm using c# and .net for my development. I have an application that makes requests to some soap web-service and for each user request it produces 3 to 10 requests for web-service, they should all run async to finish in one time, so I use Async method of the web-service generated reference and then wait for result on callback. But it seems like it starts a thread (or takes it from pool) for every async call I make, so if I have 10 clients I got to spawn 30 to 100 threads and it sounds terrible even for my 16 cores server :) So i wanted to replace low level transport implementation with my own which uses non-blocking sockets and can handle at least 50 sockets run parallel in one thread with not much overhead. But I actually dunno where to put my override best. I analyzed System.Web.Services.Protocols.SoapHttpClientProtocol class and see that it has some GetWebRequest method which I actually could use. If only I could somehow interupt the object it creates and get a http request with all headers and body from there and then send it with my own sockets.. Any ideas what approach to use? Or maybe there's something built in the framework I can use?

Read the article

multi-core processing in R on windows XP - via doMC and foreach

- by Jan

Hi guys, I'm posting this question to ask for advice on how to optimize the use of multiple processors from R on a Windows XP machine. At the moment I'm creating 4 scripts (each script with e.g. for (i in 1:100) and (i in 101:200), etc) which I run in 4 different R sessions at the same time. This seems to use all the available cpu. I however would like to do this a bit more efficient. One solution could be to use the "doMC" and the "foreach" package but this is not possible in R on a Windows machine. e.g. library("foreach") library("strucchange") library("doMC") # would this be possible on a windows machine? registerDoMC(2) # for a computer with two cores (processors) ## Nile data with one breakpoint: the annual flows drop in 1898 ## because the first Ashwan dam was built data("Nile") plot(Nile) ## F statistics indicate one breakpoint fs.nile <- Fstats(Nile ~ 1) plot(fs.nile) breakpoints(fs.nile) # , hpc = "foreach" --> It would be great to test this. lines(breakpoints(fs.nile)) Any solutions or advice? Thanks, Jan

Read the article

How do i write tasks? (parallel code)

- by acidzombie24

I am impressed with intel thread building blocks. I like how i should write task and not thread code and i like how it works under the hood with my limited understanding (task are in a pool, there wont be 100 threads on 4cores, a task is not guaranteed to run because it isnt on its own thread and may be far into the pool. But it may be run with another related task so you cant do bad things like typical thread unsafe code). I wanted to know more about writing task. I like the 'Task-based Multithreading - How to Program for 100 cores' video here http://www.gdcvault.com/sponsor.php?sponsor_id=1 (currently second last link. WARNING it isnt 'great'). My fav part was 'solving the maze is better done in parallel' which is around the 48min mark (you can click the link on the left side. That part is really all you need to watch if any). However i like to see more code examples and some API of how to write task. Does anyone have a good resource? I have no idea how a class or pieces of code may look after pushing it onto a pool or how weird code may look when you need to make a copy of everything and how much of everything is pushed onto a pool.

Read the article

How do i write task? (parallel code)

- by acidzombie24

I am impressed with intel thread building blocks. I like how i should write task and not thread code and i like how it works under the hood with my limited understanding (task are in a pool, there wont be 100 threads on 4cores, a task is not guaranteed to run because it isnt on its own thread and may be far into the pool. But it may be run with another related task so you cant do bad things like typical thread unsafe code). I wanted to know more about writing task. I like the 'Task-based Multithreading - How to Program for 100 cores' video here http://www.gdcvault.com/sponsor.php?sponsor_id=1 (currently second last link. WARNING it isnt 'great'). My fav part was 'solving the maze is better done in parallel' which is around the 48min mark (you can click the link on the left side). However i like to see more code examples and some API of how to write task. Does anyone have a good resource? I have no idea how a class or pieces of code may look after pushing it onto a pool or how weird code may look when you need to make a copy of everything and how much of everything is pushed onto a pool.

Read the article

unroll nested for loops in C++

- by Hristo

How would I unroll the following nested loops? for(k = begin; k != end; ++k) { for(j = 0; j < Emax; ++j) { for(i = 0; i < N; ++i) { if (j >= E[i]) continue; array[k] += foo(i, tr[k][i], ex[j][i]); } } } I tried the following, but my output isn't the same, and it should be: for(k = begin; k != end; ++k) { for(j = 0; j < Emax; ++j) { for(i = 0; i+4 < N; i+=4) { if (j >= E[i]) continue; array[k] += foo(i, tr[k][i], ex[j][i]); array[k] += foo(i+1, tr[k][i+1], ex[j][i+1]); array[k] += foo(i+2, tr[k][i+2], ex[j][i+2]); array[k] += foo(i+3, tr[k][i+3], ex[j][i+3]); } if (i < N) { for (; i < N; ++i) { if (j >= E[i]) continue; array[k] += foo(i, tr[k][i], ex[j][i]); } } } } I will be running this code in parallel using Intel's TBB so that it takes advantage of multiple cores. After this is finished running, another function prints out what is in array[] and right now, with my unrolling, the output isn't identical. Any help is appreciated. Thanks, Hristo

Read the article

Fastest inline-assembly spinlock

- by sigvardsen

I'm writing a multithreaded application in c++, where performance is critical. I need to use a lot of locking while copying small structures between threads, for this I have chosen to use spinlocks. I have done some research and speed testing on this and I found that most implementations are roughly equally fast: Microsofts CRITICAL_SECTION, with SpinCount set to 1000, scores about 140 time units Implementing this algorithm with Microsofts InterlockedCompareExchange scores about 95 time units Ive also tried to use some inline assembly with __asm {} using something like this code and it scores about 70 time units, but I am not sure that a proper memory barrier has been created. Edit: The times given here are the time it takes for 2 threads to lock and unlock the spinlock 1,000,000 times. I know this isn't a lot of difference but as a spinlock is a heavily used object, one would think that programmers would have agreed on the fastest possible way to make a spinlock. Googling it leads to many different approaches however. I would think this aforementioned method would be the fastest if implemented using inline assembly and using the instruction CMPXCHG8B instead of comparing 32bit registers. Furthermore memory barriers must be taken into account, this could be done by LOCK CMPXHG8B (I think?), which guarantees "exclusive rights" to the shared memory between cores. At last [some suggests] that for busy waits should be accompanied by NOP:REP that would enable Hyper-threading processors to switch to another thread, but I am not sure whether this is true or not? From my performance-test of different spinlocks, it is seen that there is not much difference, but for purely academic purpose I would like to know which one is fastest. However as I have extremely limited experience in the assembly-language and with memory barriers, I would be happy if someone could write the assembly code for the last example I provided with LOCK CMPXCHG8B and proper memory barriers in the following template: __asm { spin_lock: ;locking code. spin_unlock: ;unlocking code. }

Read the article

Time with and without OpenMP

- by was

I have a question.. I tried to improve a well known program algorithm in C, FOX algorithm for matrix multiplication.. relative link without openMP: (http://web.mst.edu/~ercal/387/MPI/ppmpi_c/chap07/fox.c). The initial program had only MPI and I tried to insert openMP in the matrix multiplication method, in order to improve the time of computation: (This program runs in a cluster and computers have 2 cores, thus I created 2 threads.) The problem is that there is no difference of time, with and without openMP. I observed that using openMP sometimes, time is equivalent or greater than the time without openMP. I tried to multiply two 600x600 matrices. void Local_matrix_multiply( LOCAL_MATRIX_T* local_A /* in */, LOCAL_MATRIX_T* local_B /* in */, LOCAL_MATRIX_T* local_C /* out */) { int i, j, k; chunk = CHUNKSIZE; // 100 #pragma omp parallel shared(local_A, local_B, local_C, chunk, nthreads) private(i,j,k,tid) num_threads(2) { /* tid = omp_get_thread_num(); if(tid == 0){ nthreads = omp_get_num_threads(); printf("O Pollaplasiamos pinakwn ksekina me %d threads\n", nthreads); } printf("Thread %d use the matrix: \n", tid); */ #pragma omp for schedule(static, chunk) for (i = 0; i < Order(local_A); i++) for (j = 0; j < Order(local_A); j++) for (k = 0; k < Order(local_B); k++) Entry(local_C,i,j) = Entry(local_C,i,j) + Entry(local_A,i,k)*Entry(local_B,k,j); } //end pragma omp parallel } /* Local_matrix_multiply */

Read the article

What limits scaling in this simple OpenMP program?

- by Douglas B. Staple

I'm trying to understand limits to parallelization on a 48-core system (4xAMD Opteron 6348, 2.8 Ghz, 12 cores per CPU). I wrote this tiny OpenMP code to test the speedup in what I thought would be the best possible situation (the task is embarrassingly parallel): // Compile with: gcc scaling.c -std=c99 -fopenmp -O3 #include <stdio.h> #include <stdint.h> int main(){ const uint64_t umin=1; const uint64_t umax=10000000000LL; double sum=0.; #pragma omp parallel for reduction(+:sum) for(uint64_t u=umin; u<umax; u++) sum+=1./u/u; printf("%e\n", sum); } I was surprised to find that the scaling is highly nonlinear. It takes about 2.9s for the code to run with 48 threads, 3.1s with 36 threads, 3.7s with 24 threads, 4.9s with 12 threads, and 57s for the code to run with 1 thread. Unfortunately I have to say that there is one process running on the computer using 100% of one core, so that might be affecting it. It's not my process, so I can't end it to test the difference, but somehow I doubt that's making the difference between a 19~20x speedup and the ideal 48x speedup. To make sure it wasn't an OpenMP issue, I ran two copies of the program at the same time with 24 threads each (one with umin=1, umax=5000000000, and the other with umin=5000000000, umax=10000000000). In that case both copies of the program finish after 2.9s, so it's exactly the same as running 48 threads with a single instance of the program. What's preventing linear scaling with this simple program?

Read the article

Is memory allocation in linux non-blocking?

- by Mark

I am curious to know if the allocating memory using a default new operator is a non-blocking operation. e.g. struct Node { int a,b; }; ... Node foo = new Node(); If multiple threads tried to create a new Node and if one of them was suspended by the OS in the middle of allocation, would it block other threads from making progress? The reason why I ask is because I had a concurrent data structure that created new nodes. I then modified the algorithm to recycle the nodes. The throughput performance of the two algorithms was virtually identical on a 24 core machine. However, I then created an interference program that ran on all the system cores in order to create as much OS pre-emption as possible. The throughput performance of the algorithm that created new nodes decreased by a factor of 5 relative the the algorithm that recycled nodes. I'm curious to know why this would occur. Thanks. *Edit : pointing me to the code for the c++ memory allocator for linux would be helpful as well. I tried looking before posting this question, but had trouble finding it.

Read the article

gcc compilations (sometimes) result in cpu underload

- by confusedCoder

I have a larger C++ program which starts out by reading thousands of small text files into memory and storing data in stl containers. This takes about a minute. Periodically, a compilation will exhibit behavior where that initial part of the program will run at about 22-23% CPU load. Once that step is over, it goes back to ~100% CPU. It is more likely to happen with O2 flag turned on but not consistently. It happens even less often with the -p flag which makes it almost impossible to profile. I did capture it once but the gprof output wasn't helpful - everything runs with the same relative speed just at low cpu usage. I am quite certain that this has nothing to do with multiple cores. I do have a quad-core cpu, and most of the code is multi-threaded, but I tested this issue running a single thread. Also, when I run the problematic step in multiple threads, each thread only runs at ~20% CPU. I apologize ahead of time for the vagueness of the question but I have run out of ideas as to how to troubleshoot it further, so any hints might be helpful. UPDATE: Just to make sure it's clear, the problematic part of the code does sometimes (~30-40% of the compilations) run at 100% CPU, so it's hard to buy the (otherwise reasonable) argument that I/O is the bottleneck

Read the article

what can cause large discrepancy between minor GC time and total pause time?

- by cxcg

We have a latency-sensitive application, and are experiencing some GC-related pauses we don't fully understand. We occasionally have a minor GC that results in application pause times that are much longer than the reported GC time itself. Here is an example log snippet: 485377.257: [GC 485378.857: [ParNew: 105845K-621K(118016K), 0.0028070 secs] 136492K-31374K(1035520K), 0.0028720 secs] [Times: user=0.01 sys=0.00, real=1.61 secs] Total time for which application threads were stopped: 1.6032830 seconds The total pause time here is orders of magnitude longer than the reported GC time. These are isolated and occasional events: the immediately preceding and succeeding minor GC events do not show this large discrepancy. The process is running on a dedicated machine, with lots of free memory, 8 cores, running Red Hat Enterprise Linux ES Release 4 Update 8 with kernel 2.6.9-89.0.1EL-smp. We have observed this with (32 bit) JVM versions 1.6.0_13 and 1.6.0_18. We are running with these flags: -server -ea -Xms512m -Xmx512m -XX:+UseConcMarkSweepGC -XX:NewSize=128m -XX:MaxNewSize=128m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:-TraceClassUnloading Can anybody offer some explanation as to what might be going on here, and/or some avenues for further investigation?

Read the article

How to debug JBoss out of memory problem?

- by user561733

Hello, I am trying to debug a JBoss out of memory problem. When JBoss starts up and runs for a while, it seems to use memory as intended by the startup configuration. However, it seems that when some unknown user action is taken (or the log file grows to a certain size) using the sole web application JBoss is serving up, memory increases dramatically and JBoss freezes. When JBoss freezes, it is difficult to kill the process or do anything because of low memory. When the process is finally killed via a -9 argument and the server is restarted, the log file is very small and only contains outputs from the startup of the newly started process and not any information on why the memory increased so much. This is why it is so hard to debug: server.log does not have information from the killed process. The log is set to grow to 2 GB and the log file for the new process is only about 300 Kb though it grows properly during normal memory circumstances. This is information on the JBoss configuration: JBoss (MX MicroKernel) 4.0.3 JDK 1.6.0 update 22 PermSize=512m MaxPermSize=512m Xms=1024m Xmx=6144m This is basic info on the system: Operating system: CentOS Linux 5.5 Kernel and CPU: Linux 2.6.18-194.26.1.el5 on x86_64 Processor information: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz, 8 cores This is good example information on the system during normal pre-freeze conditions a few minutes after the jboss service startup: Running processes: 183 CPU load averages: 0.16 (1 min) 0.06 (5 mins) 0.09 (15 mins) CPU usage: 0% user, 0% kernel, 1% IO, 99% idle Real memory: 17.38 GB total, 2.46 GB used Virtual memory: 19.59 GB total, 0 bytes used Local disk space: 113.37 GB total, 11.89 GB used When JBoss freezes, system information looks like this: Running processes: 225 CPU load averages: 4.66 (1 min) 1.84 (5 mins) 0.93 (15 mins) CPU usage: 0% user, 12% kernel, 73% IO, 15% idle Real memory: 17.38 GB total, 17.18 GB used Virtual memory: 19.59 GB total, 706.29 MB used Local disk space: 113.37 GB total, 11.89 GB used

Read the article

Parallel programming, are we not learning from history again?

- by mezmo

I started programming because I was a hardware guy that got bored, I thought the problems being solved in the software side of things were much more interesting than those in hardware. At that time, most of the electrical buses I dealt with were serial, some moving data as fast as 1.5 megabit!! ;) Over the years these evolved into parallel buses in order to speed communication up, after all, transferring 8/16/32/64, whatever bits at a time incredibly speeds up the transfer. Well, our ability to create and detect state changes got faster and faster, to the point where we could push data so fast that interference between parallel traces or cable wires made cleaning the signal too expensive to continue, and we still got reasonable performance from serial interfaces, heck some graphics interfaces are even happening over USB for a while now. I think I'm seeing a like trend in software now, our processors were getting faster and faster, so we got good at building "serial" software. Now we've hit a speed bump in raw processor speed, so we're adding cores, or "traces" to the mix, and spending a lot of time and effort on learning how to properly use those. But I'm also seeing what I feel are advances in things like optical switching and even quantum computing that could take us far more quickly that I was expecting back to the point where "serial programming" again makes the most sense. What are your thoughts?

Read the article

Slowing process creation under Java?

- by oconnor0

I have a single, large heap (up to 240GB, though in the 20-40GB range for most of this phase of execution) JVM [1] running under Linux [2] on a server with 24 cores. We have tens of thousands of objects that have to be processed by an external executable & then load the data created by those executables back into the JVM. Each executable produces about half a megabyte of data (on disk) that when read right in, after the process finishes, is, of course, larger. Our first implementation was to have each executable handle only a single object. This involved the spawning of twice as many executables as we had objects (since we called a shell script that called the executable). Our CPU utilization would start off high, but not necessarily 100%, and slowly worsen. As we began measuring to see what was happening we noticed that the process creation time [3] continually slows. While starting at sub-second times it would eventually grow to take a minute or more. The actual processing done by the executable usually takes less than 10 seconds. Next we changed the executable to take a list of objects to process in an attempt to reduce the number of processes created. With batch sizes of a few hundred (~1% of our current sample size), the process creation times start out around 2 seconds & grow to around 5-6 seconds. Basically, why is it taking so long to create these processes as execution continues? [1] Oracle JDK 1.6.0_22 [2] Red Hat Enterprise Linux Advanced Platform 5.3, Linux kernel 2.6.18-194.26.1.el5 #1 SMP [3] Creation of the ProcessBuilder object, redirecting the error stream, and starting it.

Read the article

How to synchronize access to many objects

- by vividos

I have a thread pool with some threads (e.g. as many as number of cores) that work on many objects, say thousands of objects. Normally I would give each object a mutex to protect access to its internals, lock it when I'm doing work, then release it. When two threads would try to access the same object, one of the threads has to wait. Now I want to save some resources and be scalable, as there may be thousands of objects, and still only a hand full of threads. I'm thinking about a class design where the thread has some sort of mutex or lock object, and assigns the lock to the object when the object should be accessed. This would save resources, as I only have as much lock objects as I have threads. Now comes the programming part, where I want to transfer this design into code, but don't know quite where to start. I'm programming in C++ and want to use Boost classes where possible, but self written classes that handle these special requirements are ok. How would I implement this? My first idea was to have a boost::mutex object per thread, and each object has a boost::shared_ptr that initially is unset (or NULL). Now when I want to access the object, I lock it by creating a scoped_lock object and assign it to the shared_ptr. When the shared_ptr is already set, I wait on the present lock. This idea sounds like a heap full of race conditions, so I sort of abandoned it. Is there another way to accomplish this design? A completely different way?

Read the article

How to multi-thread this?

- by WilliamKF

I wish to have two threads. The first thread1 occasionally calls the following pseudo function: void waitForThread2() { if (thread2 is not idle) { return; } notifyThread2IamReady(); while (thread2IsExclusive) { } } The second thread2 is forever in the following pseudo loop: for (;;) { Notify thread1 I am idle. while (!thread1IsReady()) { } Notify thread1 I am exclusive. Do some work while thread1 is blocked. Notify thread1 I am busy. Do some work in parallel with thread1. } What is the best way to write this such that both thread1 and thread2 are kept as busy as possible on a machine with multiple cores. I would like to avoid long delays between notification in one thread and detection by the other. I tried using pthread condition variables but found the delay between thread2 doing 'notify thread1 I am busy' and the loop in waitForThread2() on thear2IsExclusive() can be up to almost one second delay. I then tried using a volatile sig_atomic_t shared variable to control the same, but something is going wrong, so I must not be doing it correctly.

Read the article

Trouble understanding the semantics of volatile in Java

- by HungryTux

I've been reading up about the use of volatile variables in Java. I understand that they ensure instant visibility of their latest updates to all the threads running in the system on different cores/processors. However no atomicity of the operations that caused these updates is ensured. I see the following literature being used frequently A write to a volatile field happens-before every read of that same field . This is where I am a little confused. Here's a snippet of code which should help me better explain my query. volatile int x = 0; volatile int y = 0; Thread-0: | Thread-1: | if (x==1) { | if (y==1) { return false; | return false; } else { | } else { y=1; | x=1; return true; | return true; } | } Since x & y are both volatile, we have the following happens-before edges between the write of y in Thread-0 and read of y in Thread-1 between the write of x in Thread-1 and read of x in Thread-0 Does this imply that, at any point of time, only one of the threads can be in its 'else' block(since a write would happen before the read)? It may well be possible that Thread-0 starts, loads x, finds it value as 0 and right before it is about to write y in the else-block, there's a context switch to Thread-1 which loads y finds it value as 0 and thus enters the else-block too. Does volatile guard against such context switches (seems very unlikely)?

Read the article

Parallel features in .Net 4.0

- by Jonathan.Peppers

I have been going over the practicality of some of the new parallel features in .Net 4.0. Say I have code like so: foreach (var item in myEnumerable) myDatabase.Insert(item.ConvertToDatabase()); Imagine myDatabase.Insert is performing some work to insert to a SQL database. Theoretically you could write: Parallel.ForEach(myEnumerable, item => myDatabase.Insert(item.ConvertToDatabase())); And automatically you get code that takes advantage of multiple cores. But what if myEnumerable can only be interacted with by a single thread? Will the Parallel class enumerate by a single thread and only dispatch the result to worker threads in the loop? What if myDatabase can only be interacted with by a single thread? It would certainly not be better to make a database connection per iteration of the loop. Finally, what if my "var item" happens to be a UserControl or something that must be interacted with on the UI thread? What design pattern should I follow to solve these problems? It's looking to me that switching over to Parallel/PLinq/etc is not exactly easy when you are dealing with real-world applications.

Search Results

Search found 854 results on 35 pages for 'cores'.

Page 29/35 | < Previous Page | 25 26 27 28 29 30 31 32 33 34 35 | Next Page >

- by user313209

- by user359708

- by zhang18

- by Thijs Cramer

- by WilliamKF

- by Piotr Kochanski

- by analogy

- by hoodoos

- by Jan

- by acidzombie24

- by acidzombie24

- by Hristo

- by sigvardsen

- by was

- by Douglas B. Staple

- by Mark

- by confusedCoder

- by cxcg

- by user561733

- by mezmo

- by oconnor0

- by vividos

- by WilliamKF

- by HungryTux

- by Jonathan.Peppers

< Previous Page | 25 26 27 28 29 30 31 32 33 34 35 | Next Page >