Search Results

Search found 1774 results on 71 pages for 'parallel'.

Page 20/71 | < Previous Page | 16 17 18 19 20 21 22 23 24 25 26 27 | Next Page >

MPI Large Data all to all transfer

- by csslayer

My application of MPI has some process that generate some large data. Say we have N+1 process (one for master control, others are workers), each of worker processes generate large data, which is now simply write to normal file, named file1, file2, ..., fileN. The size of each file may be quite different. Now I need to send all fileM to rank M process to do the next job, So it's just like all to all data transfer. My problem is how should I use MPI API to send these files efficiently? I used to use windows share folder to transfer these before, but I think it's not a good idea. I have think about MPI_file and MPI_All_to_all, but these functions seems not to be so suitable for my case. Simple MPI_Send and MPI_Recv seems hard to be used because every process need to transfer large data, and I don't want to use distributed file system for now.

Read the article
help me understand cuda

- by scatman

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda. please could anybody clarify these info: an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs. i was viewing stanford's video presentation and it was saying that every SP is capable of running 96 threads cuncurrently. does this mean that it (SP) can run 96/32=3 warps concurrently? moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads? a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.

Read the article
add uchar values in ushort array with sse2 or sse3

- by pompolus

i have an unsigned short dst[16][16] matrix and a larger unsigned char src[m][n] matrix. Now i have to access in the src matrix and add a 16x16 submatrix to dst, using sse2 or ss3. In a my older implementation, I was sure that my summed values ??were never greater than 256, so i could do this: for (int row = 0; row < 16; ++row) { __m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src)); dst[row] = _mm_add_epi8(dst[row], subMat); src += W; // Step to next row i need to add } where W is an offset to reach the desired rows. This code works, but now my values in src are larger and summed could be greater than 256, so i need to store them as ushort. i've tried this: for (int row = 0; row < 16; ++row) { __m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src)); dst[row] = _mm_add_epi16(dst[row], subMat); src += W; // Step to next row i need to add } but it doesn't work. I'm not so good with sse, so any help will be appreciated.

Read the article
How can i connect two or more machines via tcp cable to form a network grid?

- by Gath

How can i connect two or more machines to form a network grid and how can i distribute work load to the two machines? What operating systems do i need to run on the machines, and what application should i use to manage the load balancing? NB: I read somewhere that google uses cheap machines to perform this fete, how do they connect two network cards( 'Teaming' ) and distribute load across the machines? Good practical examples would serve me good, with actual code samples. Pointers to some good site i might read this stuff will be highly appreciated.

Read the article
Why are Asynchronous processes not called Synchronous?

- by Balk

So I'm a little confused by this terminology. Everyone refers to "Asynchronous" computing as running different processes on seperate threads, which gives the illusion that these processes are running at the same time. This is not the definition of the word asynchronous. a·syn·chro·nous –adjective 1. not occurring at the same time. 2. (of a computer or other electrical machine) having each operation started only after the preceding operation is completed. What am I not understanding here?

Read the article
Executing functions parallelly in PHP

- by binaryLV

Hi! Can PHP call a function and don't wait for it to return? So something like this: function callback($pause, $arg) { sleep($pause); echo $arg, "\n"; } header('Content-Type: text/plain'); fast_call_user_func_array('callback', array(3, 'three')); fast_call_user_func_array('callback', array(2, 'two')); fast_call_user_func_array('callback', array(1, 'one')); would output one (after 1 second) two (after 2 seconds) three (after 3 seconds) rather than three (after 3 seconds) two (after 3 + 2 = 5 seconds) one (after 3 + 2 + 1 = 6 seconds) Main script is intended to be run as a permanent process (TCP server). callback() function would receive data from client, execute external PHP script and then do something based on other arguments that are passed to callback(). The problem is that main script must not wait for external PHP script to finish. Result of external script is important, so exec('php -f file.php &') is not an option.

Read the article
Setting the cores to use in Parallelism

- by Ben

Hi, I have a feeling the answer to this is no, but using .Net 4.0's Parallelism, can you set the amount of cores on which to run i.e. if your running a Quad Core, can you set your Application to only use 2 of them? Thanks

Read the article
MPI Odd/Even Compare-Split Deadlock

- by erebel55

I'm trying to write an MPI version of a program that runs an odd/even compare-split operation on n randomly generated elements. Process 0 should generated the elements and send nlocal of them to the other processes, (keeping the first nlocal for itself). From here, process 0 should print out it's results after running the CompareSplit algorithm. Then, receive the results from the other processes run of the algorithm. Finally, print out the results that it has just received. I have a large chunk of this already done, but I'm getting a deadlock that I can't seem to fix. I would greatly appreciate any hints that people could give me. Here is my code http://pastie.org/3742474 Right now I'm pretty sure that the deadlock is coming from the Send/Recv at lines 134 and 151. I've tried changing the Send to use "tag" instead of myrank for the tag parameter..but when I did that I just keep getting a "MPI_ERR_TAG: invalid tag" for some reason. Obviously I would also run the algorithm within the processors 0 but I took that part out for now, until I figure out what is going wrong. Any help is appreciated.

Read the article
Can a Perl subroutine return data but keep processing?

- by Perl QuestionAsker

Is there any way to have a subroutine send data back while still processing? For instance (this example used simply to illustrate) - a subroutine reads a file. While it is reading through the file, if some condition is met, then "return" that line and keep processing. I know there are those that will answer - why would you want to do that? and why don't you just ...?, but I really would like to know if this is possible. Thank you so much in advance.

Read the article
Simple C++ container class that is thread-safe for writing

- by conradlee

I am writing a multi-threaded program using OpenMP in C++. At one point my program forks into many threads, each of which need to add "jobs" to some container that keeps track of all added jobs. Each job can just be a pointer to some object. Basically, I just need the add pointers to some container from several threads at the same time. Is there a simple solution that performs well? After some googling, I found that STL containers are not thread-safe. Some stackoverflow threads address this question, but none that forms a consensus on a simple solution.

Read the article
Can I use MPI_Probe to probe messsages sent by any collective operation?

- by takwing

In my code I have a server process repeatedly probing for incoming messages, which come in two types. One type of the two will be sent once by each process to give hint to the server process about its termination. I was wondering if it is valid to use MPI_Broadcast to broadcast these termination messages and use MPI_Probe to probe their arrivals. I tried using this combination but it failed. This failure might have been caused by some other things. So I would like anyone who knows about this to confirm. Cheers.

Read the article
Take advantage of multiple cores executing SQL statements

- by willvv

I have a small application that reads XML files and inserts the information on a SQL DB. There are ~ 300 000 files to import, each one with ~ 1000 records. I started the application on 20% of the files and it has been running for 18 hours now, I hope I can improve this time for the rest of the files. I'm not using a multi-thread approach, but since the computer I'm running the process on has 4 cores I was thinking on doing it to get some improvement on the performance (although I guess the main problem is the I/O and not only the processing). I was thinking on using the BeginExecutingNonQuery() method on the SqlCommand object I create for each insertion, but I don't know if I should limit the max amount of simultaneous threads (nor I know how to do it). What's your advice to get the best CPU utilization? Thanks

Read the article
how to do thread in matlab?

- by hai

how to do thread in matlab? i want to run one function on two variables simulatniosly how to do it?

Read the article
how to cancel an awaiting task c#

- by user1748906

I am trying to cancel a task awaiting for network IO using CancellationTokenSource, but I have to wait until TcpClient connects: try { while (true) { token.Token.ThrowIfCancellationRequested(); Thread.Sleep(int.MaxValue); //simulating a TcpListener waiting for request } } any ideas ? Secondly, is it OK to start each client in a separate task ?

Read the article
[C++] Needed: A simple C++ container (stack, linked list) that is thread-safe for writing

- by conradlee

I am writing a multi-threaded program using OpenMP in C++. At one point my program forks into many threads, each of which need to add "jobs" to some container that keeps track of all added jobs. Each job can just be a pointer to some object. Basically, I just need the add pointers to some container from several threads at the same time. Is there a simple solution that performs well? After some googling, I found that STL containers are not thread-safe. Some stackoverflow threads address this question, but none form a consensus on a simple solution.

Read the article
Parallelism in .NET – Part 5, Partitioning of Work

- by Reed

When parallelizing any routine, we start by decomposing the problem. Once the problem is understood, we need to break our work into separate tasks, so each task can be run on a different processing element. This process is called partitioning. Partitioning our tasks is a challenging feat. There are opposing forces at work here: too many partitions adds overhead, too few partitions leaves processors idle. Trying to work the perfect balance between the two extremes is the goal for which we should aim. Luckily, the Task Parallel Library automatically handles much of this process. However, there are situations where the default partitioning may not be appropriate, and knowledge of our routines may allow us to guide the framework to making better decisions. First off, I’d like to say that this is a more advanced topic. It is perfectly acceptable to use the parallel constructs in the framework without considering the partitioning taking place. The default behavior in the Task Parallel Library is very well-behaved, even for unusual work loads, and should rarely be adjusted. I have found few situations where the default partitioning behavior in the TPL is not as good or better than my own hand-written partitioning routines, and recommend using the defaults unless there is a strong, measured, and profiled reason to avoid using them. However, understanding partitioning, and how the TPL partitions your data, helps in understanding the proper usage of the TPL. I indirectly mentioned partitioning while discussing aggregation. Typically, our systems will have a limited number of Processing Elements (PE), which is the terminology used for hardware capable of processing a stream of instructions. For example, in a standard Intel i7 system, there are four processor cores, each of which has two potential hardware threads due to Hyperthreading. This gives us a total of 8 PEs – theoretically, we can have up to eight operations occurring concurrently within our system. In order to fully exploit this power, we need to partition our work into Tasks. A task is a simple set of instructions that can be run on a PE. Ideally, we want to have at least one task per PE in the system, since fewer tasks means that some of our processing power will be sitting idle. A naive implementation would be to just take our data, and partition it with one element in our collection being treated as one task. When we loop through our collection in parallel, using this approach, we’d just process one item at a time, then reuse that thread to process the next, etc. There’s a flaw in this approach, however. It will tend to be slower than necessary, often slower than processing the data serially. The problem is that there is overhead associated with each task. When we take a simple foreach loop body and implement it using the TPL, we add overhead. First, we change the body from a simple statement to a delegate, which must be invoked. In order to invoke the delegate on a separate thread, the delegate gets added to the ThreadPool’s current work queue, and the ThreadPool must pull this off the queue, assign it to a free thread, then execute it. If our collection had one million elements, the overhead of trying to spawn one million tasks would destroy our performance. The answer, here, is to partition our collection into groups, and have each group of elements treated as a single task. By adding a partitioning step, we can break our total work into small enough tasks to keep our processors busy, but large enough tasks to avoid overburdening the ThreadPool. There are two clear, opposing goals here: Always try to keep each processor working, but also try to keep the individual partitions as large as possible. When using Parallel.For, the partitioning is always handled automatically. At first, partitioning here seems simple. A naive implementation would merely split the total element count up by the number of PEs in the system, and assign a chunk of data to each processor. Many hand-written partitioning schemes work in this exactly manner. This perfectly balanced, static partitioning scheme works very well if the amount of work is constant for each element. However, this is rarely the case. Often, the length of time required to process an element grows as we progress through the collection, especially if we’re doing numerical computations. In this case, the first PEs will finish early, and sit idle waiting on the last chunks to finish. Sometimes, work can decrease as we progress, since previous computations may be used to speed up later computations. In this situation, the first chunks will be working far longer than the last chunks. In order to balance the workload, many implementations create many small chunks, and reuse threads. This adds overhead, but does provide better load balancing, which in turn improves performance. The Task Parallel Library handles this more elaborately. Chunks are determined at runtime, and start small. They grow slowly over time, getting larger and larger. This tends to lead to a near optimum load balancing, even in odd cases such as increasing or decreasing workloads. Parallel.ForEach is a bit more complicated, however. When working with a generic IEnumerable<T>, the number of items required for processing is not known in advance, and must be discovered at runtime. In addition, since we don’t have direct access to each element, the scheduler must enumerate the collection to process it. Since IEnumerable<T> is not thread safe, it must lock on elements as it enumerates, create temporary collections for each chunk to process, and schedule this out. By default, it uses a partitioning method similar to the one described above. We can see this directly by looking at the Visual Partitioning sample shipped by the Task Parallel Library team, and available as part of the Samples for Parallel Programming. When we run the sample, with four cores and the default, Load Balancing partitioning scheme, we see this: The colored bands represent each processing core. You can see that, when we started (at the top), we begin with very small bands of color. As the routine progresses through the Parallel.ForEach, the chunks get larger and larger (seen by larger and larger stripes). Most of the time, this is fantastic behavior, and most likely will out perform any custom written partitioning. However, if your routine is not scaling well, it may be due to a failure in the default partitioning to handle your specific case. With prior knowledge about your work, it may be possible to partition data more meaningfully than the default Partitioner. There is the option to use an overload of Parallel.ForEach which takes a Partitioner<T> instance. The Partitioner<T> class is an abstract class which allows for both static and dynamic partitioning. By overriding Partitioner<T>.SupportsDynamicPartitions, you can specify whether a dynamic approach is available. If not, your custom Partitioner<T> subclass would override GetPartitions(int), which returns a list of IEnumerator<T> instances. These are then used by the Parallel class to split work up amongst processors. When dynamic partitioning is available, GetDynamicPartitions() is used, which returns an IEnumerable<T> for each partition. If you do decide to implement your own Partitioner<T>, keep in mind the goals and tradeoffs of different partitioning strategies, and design appropriately. The Samples for Parallel Programming project includes a ChunkPartitioner class in the ParallelExtensionsExtras project. This provides example code for implementing your own, custom allocation strategies, including a static allocator of a given chunk size. Although implementing your own Partitioner<T> is possible, as I mentioned above, this is rarely required or useful in practice. The default behavior of the TPL is very good, often better than any hand written partitioning strategy.

Read the article
iPhone Mobile Safari, How many max parallel http connections?

- by user316994

I would like to use parallel AJAX HTTP requests with iPhone Mobile Safari (OS4). What is the max number of parallel connections?

Read the article
Is there a way to control how pytest-xdist runs tests in parallel?

- by superselector

I have the following directory layout: runner.py lib/ tests/ testsuite1/ testsuite1.py testsuite2/ testsuite2.py testsuite3/ testsuite3.py testsuite4/ testsuite4.py The format of testsuite*.py modules is as follows: import pytest class testsomething: def setup_class(self): ''' do some setup ''' # Do some setup stuff here def teardown_class(self): '''' do some teardown''' # Do some teardown stuff here def test1(self): # Do some test1 related stuff def test2(self): # Do some test2 related stuff .... .... .... def test40(self): # Do some test40 related stuff if __name__=='__main()__' pytest.main(args=[os.path.abspath(__file__)]) The problem I have is that I would like to execute the 'testsuites' in parallel i.e. I want testsuite1, testsuite2, testsuite3 and testsuite4 to start execution in parallel but individual tests within the testsuites need to be executed serially. When I use the 'xdist' plugin from py.test and kick off the tests using 'py.test -n 4', py.test is gathering all the tests and randomly load balancing the tests among 4 workers. This leads to the 'setup_class' method to be executed every time of each test within a 'testsuitex.py' module (which defeats my purpose. I want setup_class to be executed only once per class and tests executed serially there after). Essentially what I want the execution to look like is: worker1: executes all tests in testsuite1.py serially worker2: executes all tests in testsuite2.py serially worker3: executes all tests in testsuite3.py serially worker4: executes all tests in testsuite4.py serially while worker1, worker2, worker3 and worker4 are all executed in parallel. Is there a way to achieve this in 'pytest-xidst' framework? The only option that I can think of is to kick off different processes to execute each test suite individually within runner.py: def test_execute_func(testsuite_path): subprocess.process('py.test %s' % testsuite_path) if __name__=='__main__': #Gather all the testsuite names for each testsuite: multiprocessing.Process(test_execute_func,(testsuite_path,))

Read the article
How can I test if a point lies between two parallel lines?

- by Harold

In the game I'm designing there is a blast that shoots out from an origin point towards the direction of the mouse. The width of this blast is always going to be the same. Along the bottom of the screen (what's currently) squares move about which should be effected by the blast that the player controls. Currently I am trying to work out a way to discover if the corners of these squares are within the blast's two bounding lines. I thought the best way to do this would be to rotate the corners of the square around an origin point as if the blast were completely horizontal and see if the Y values of the corners were less than or equal to the width of the blast which would mean that they lie within the effected region, but I can't work out

Read the article
How to get scripted programs governing game entities run in parallel with a game loop?

- by Jim

I recently discovered Crobot which is (briefly) a game where each player codes a virtual robot in a pseudo-C language. Each robot is then put in an arena where it fights against other robots. A robots' source code has this shape : /* Beginning file robot.r */ main() { while (1) { /* Do whatever you want */ ... move(); ... fire(); } } /* End file robot.r */ You can see that : The code is totally independent from any library/include Some predefined functions are available (move, fire, etc…) The program has its own game loop, and consequently is not called every frame My question is: How to achieve a similar result using scripted languages in collaboration with a C/C++ main program ? I found a possible approach using Python, multi-threading and shared memory, although I am not sure yet that it is possible this way. TCP/IP seems a bit too complicated for this kind of application.

Read the article
How to find optimal path visit every node with parallel workers complicated by dynamic edge costs?

- by Aaron Anodide

Say you have an acyclic directed graph with weighted edges and create N workers. My goal is to calculate the optimal way those workers can traverse the entire graph in parralel. However, edge costs may change along the way. Example: A -1-> B A -2-> C B -3-> C (if A has already been visited) B -5-> C (if A has not already been visited) Does what I describe lend itself to a standard algorithmic approach, or alternately can someone suggest if I'm looking at this in an inherently flawed way (i have an intuition I might be)?

Read the article
How to prioritize tasks when you have multiple programming projects running in parallel?

- by Vinko Vrsalovic

Say you have 5 customers, you develop 2 or 3 different projects for each. Each project has Xi tasks. Each project takes from 2 to 10 man weeks. Given that there are few resources, it is desired to minimize the management overhead. Two questions in this scenario: What tools would you use to prioritize the tasks and track their completion, while tending to minimize the overhead? What criteria would you take into consideration to determine which task to assign to the next available resource given that the primary objective is to increase throughput (more projects finished per time unit, this objective conflicts with starting one project and finishing it and then moving on to the next)? Ideas, management techniques, algorithms are welcome

Read the article
Why is lowlatency kernel not being updated in parallel with the generic kernel?

- by FlabbergastedPickle

All, Any idea when we'll see updates to the lowlatency version of the Ubuntu 12.04 kernel? It is still stuck at 3.2.0.23 whereas the generic kernel is already several updates ahead of it at a version 3.2.0.25? NB: I am using a 64-bit version but I don't think this is limited to the 64-bit kernels alone but rather affects both 32-bit and 64-bit builds. Please do correct me if I am wrong about this.

Read the article
Shows how to use the new Tasks namespace to download multiple documents in parallel.

In C# 4.0, Task parallelism is the lowest-level approach to parallelization with PFX. The classes for working at this level are defined in the System.Threading.Tasks namespace. read moreBy Peter BrombergDid you know that DotNetSlackers also publishes .net articles written by top known .net Authors? We already have over 80 articles in several categories including Silverlight. Take a look: here.

Read the article
Virtual PC, XP mode. Bidirectional parallel port physical LPT1 not available in VPC XP mode?

- by Tom Merklin

I need my LPT1 port to be available in XP mode. In the "guest" window, it shows up in system hardware, but it is not available to the application. If I look into Tools, settings there are no parallel port available? How can I fix this problem? (If I use VMWare Workstation XP mode, I load a driver and it works perfect).

Read the article

< Previous Page | 16 17 18 19 20 21 22 23 24 25 26 27 | Next Page >