Search Results

Search found 55 results on 3 pages for 'parallelization'.

Page 1/3 | 1 2 3 | Next Page >

Automatic Parallelization

Multithreading an application to improve performance can be a time-consuming activity Business - Accounting - Time Tracking - Business Services - Windows

Read the article
Bash Parallelization of CPU-intensive processes

- by ehsanul

tee forwards its stdin to every single file specified, while pee does the same, but for pipes. These programs send every single line of their stdin to each and every file/pipe specified. However, I was looking for a way to "load balance" the stdin to different pipes, so one line is sent to the first pipe, another line to the second, etc. It would also be nice if the stdout of the pipes are collected into one stream as well. The use case is simple parallelization of CPU intensive processes that work on a line-by-line basis. I was doing a sed on a 14GB file, and it could have run much faster if I could use multiple sed processes. The command was like this: pv infile | sed 's/something//' > outfile To parallelize, the best would be if GNU parallel would support this functionality like so (made up the --demux-stdin option): pv infile | parallel -u -j4 --demux-stdin "sed 's/something//'" > outfile However, there's no option like this and parallel always uses its stdin as arguments for the command it invokes, like xargs. So I tried this, but it's hopelessly slow, and it's clear why: pv infile | parallel -u -j4 "echo {} | sed 's/something//'" > outfile I just wanted to know if there's any other way to do this (short of coding it up myself). If there was a "load-balancing" tee (let's call it lee), I could do this: pv infile | lee >(sed 's/something//' >> outfile) >(sed 's/something//' >> outfile) >(sed 's/something//' >> outfile) >(sed 's/something//' >> outfile) Not pretty, so I'd definitely prefer something like the made up parallel version, but this would work too.

Read the article
Windows Azure: Parallelization of the code

- by veda

I have some matrix multiplication operation. I want to parallelize the execution of those operations through multiple processors.. This can be done on high performance computing cluster using MPI (Message Passing Interface). Like wise, can I do some parallelization in the cloud using multiple worker roles. Is there any means for doing that.

Read the article
Do you know any build systems with decent support for parallelization?

- by dahpgjgamgan

Hi, I am looking for a build system (working on ms windows) that has good support for parallelization of tasks/targets (or whatever you call them). To be more specific - during build (that is initiated on MS Windows machine) I need to copy source files to a number of different machines (which are not necessarily running Windows) and start a remote job on each of them - and I really like to do that on all machines at once. Does anyone know a build system that's capable of executing such a task in parallel. From what I googled, the options currently available are: -j switch in make - but i don't know if nmake supports this -some custom nAnt tasks -msbuild has some form of support for parallelization - seems similiar to make (meaning you don't specify what to do in parallel, just specify that it would be nice to build things that way) -fake (f# make) is written in functional programming language which are known to have good parallelization support - but I'm not very skillful in functional programming area. Any other solutions I could explore?

Read the article
Parallelize or vectorize all-against-all operation on a large number of matrices?

- by reve_etrange

I have approximately 5,000 matrices with the same number of rows and varying numbers of columns (20 x ~200). Each of these matrices must be compared against every other in a dynamic programming algorithm. In this question, I asked how to perform the comparison quickly and was given an excellent answer involving a 2D convolution. Serially, iteratively applying that method, like so list = who('data_matrix_prefix*') H = cell(numel(list),numel(list)); for i=1:numel(list) for j=1:numel(list) if i ~= j eval([ 'H{i,j} = compare(' char(list(i)) ',' char(list(j)) ');']); end end end is fast for small subsets of the data (e.g. for 9 matrices, 9*9 - 9 = 72 calls are made in ~1 s). However, operating on all the data requires almost 25 million calls. I have also tried using deal() to make a cell array composed entirely of the next element in data, so I could use cellfun() in a single loop: # who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data. nextData = cell(k,1); for i=1:k [nextData{:}] = deal(data{i}); H{:,i} = cellfun(@compare,data,nextData,'UniformOutput',false); end Unfortunately, this is not really any faster, because all the time is in compare(). Both of these code examples seem ill-suited for parallelization. I'm having trouble figuring out how to make my variables sliced. compare() is totally vectorized; it uses matrix multiplication and conv2() exclusively (I am under the impression that all of these operations, including the cellfun(), should be multithreaded in MATLAB?). Does anyone see a (explicitly) parallelized solution or better vectorization of the problem?

Read the article
how to implement intel's tbb::blocked_range2d in C++

- by Hristo

I'm trying to parallelize nested for loops with parellel_for() and the blocked_range2d from Intel's TBB using C++. The for loops look like this: for(int i = 0; i < N; ++i) { for(int j = 0; j < E[i]; ++j) { for(int k = 0; k < T; ++k) { score[k] += delta(i, trRating[k][i], exRating[j][i]); } } } ... and I am trying to do the following: class LoopBody { private: int *myscore; public: LoopBody(int *score) { myscore = score; } void operator()(const blocked_range2d<int> &r) const { for(int i = r.rows().begin(); i != r.rows().end(); ++i); for(int j = 0; j < E[i]; ++j) { for(int k = r.cols().begin(); k != r.cols().end(); ++k) { myscore[k] += foo(...); // uses i,j,k to look up indices in arrays } } } } }; void computeScores(int score[]) { parallel_for(blocked_range2d<int>(0, N, 0, T), LoopBody(score)); } ... but I am getting the following compile errors: error: identifier "i" is undefined for(int j = 0; j < E[i]; ++j) { ^ error: expected an identifier }; ^ I'm not really sure if I am doing this the right way, but any advice is appreciated. Also, this is my first time using Intel's TBB so I really don't know anything about it. Any ideas how make this work? Thanks, Hristo

Read the article
Improving the Performance of the Secure Hash Algorithm (SHA-1)

Parallelization can make a difference Algorithm - Math - Cryptography - Communication Theory - Parallel computing

Read the article
Shows how to use the new Tasks namespace to download multiple documents in parallel.

In C# 4.0, Task parallelism is the lowest-level approach to parallelization with PFX. The classes for working at this level are defined in the System.Threading.Tasks namespace. read moreBy Peter BrombergDid you know that DotNetSlackers also publishes .net articles written by top known .net Authors? We already have over 80 articles in several categories including Silverlight. Take a look: here.

Read the article
Real World Java EE Patterns by Adam Bien

- by JuergenKress

Rethinking Best Practices, A book about rethinking patterns, best practices, idioms and Java EE Real World Java EE Patterns - Rethinking Best Practices discusses patterns and best practices in a structured way, with code from real world projects. This book covers: an introduction into the core principles and APIs of Java EE 6, principles of transactions, isolation levels, CAP and BASE, remoting, pragmatic modularization and structure of Java EE applications, discussion of superfluous patterns and outdated best practices, patterns for domain driven and service oriented components, custom scopes, asynchronous processing and parallelization, real time HTTP events, schedulers, REST optimizations, plugins and monitoring tools, and fully functional JCA 1.6 implementation. Real World Java EE Night Hacks - Dissecting the Business Tier will not only help experienced developers and architects to write concise code, but especially help you to shrink the codebase to unbelievably small sizes :-). Order here. WebLogic Partner Community For regular information become a member in the WebLogic Partner Community please visit: http://www.oracle.com/partners/goto/wls-emea ( OPN account required). If you need support with your account please contact the Oracle Partner Business Center. BlogTwitterLinkedInMixForumWiki Technorati Tags: Adam Bien,Real World Java,Java,Java EE,WebLogic Community,Oracle,OPN,Jürgen Kress

Read the article
DB DOC Enhancements for Oracle SQL Developer v4

- by thatjeffsmith

One of our more popular features is ‘DB Doc.’ It’s like JAVADOC for the database. Pick a connection, right-click, and go. It will generate an HTML documentation set for that schema. For version 4, we’ve introduced a few enhancements based on user requests. That’s right, you asked, and we listened. Added support for Package Bodies Added parallelization option for larger doc sets Enhanced the HTML formatting a bit Select Your Object Types and Generation Options We’ve changed the default selection of object types to be included and added support for package bodies There’s also an option to auto-open the documentation set after it’s been generated. And the HTML As Requested

Read the article
What are the challenges and benefits of writing games with a functional language?

- by McMuttons

While I know that functional languages aren't the most commonly used for game writing, there are a lot of benefits associate with them that seem like they would be interesting in any programming context. Especially the ease of parallelization I would think could be very useful as focus is moving toward more and more processors. Also, with F# as a new member of the .NET family, it can be used directly with XNA, for example, which lowers the threshold quite a bit, as opposed to going with LISP, Haskell, Erlang, etc. If anyone has experience writing games with functional code, what has turned out to be the positives and negatives? What was it suited for, what not? Edit: Finding it hard to decide that there's a single good answer for this, so it's probably better suited as a community wiki post.

Read the article
What programming language matches this description? [on hold]

- by Benubird

I am looking for a functional language that is basically dynamic programming - i.e. one where functions are first-class objects - but where all function calls are asynchronous by default; i.e. you define function X(a,b) = (Y(a)+Z(b)), and when X() is called, it sees it is waiting for the return from two functions, runs one in the current thread, and spawns a new thread to run the other. The future is very much parallel processing; multiple cores, multiple machines, the internet of things, etc. and I was wondering if there was a language specifically designed to make this kind of parallelization easy. I currently have only used imperative languages (c, php, java, ruby, etc), so I don't know anything about what kind of functional languages are available.

Read the article
Next programming paradigm for CBE/GPU in the next years

- by Werner

Hi, in the last five years, there has been a rise in the use of GPU and CBE for parallelization of applications. Around 2005-2007 verything seemed to be programmed by hand, C, etc. Afterwards new unifying alternatives emerged like CUDA for GPU and lastly OpenCL. What do you think will be the programming paradigm for GPU/CBE in the forthcoming years? My vote goes for OpenCL Thanks

Read the article
Where can I find a C# implementation of SHA-1 not using the built in libraries?

- by Peter

I am trying to rewrite the sha-1 algorithm for parallelization in a personal project of mine. Just wondering if anyone knows where I can find a C# implementation of the actual mathematical operations to perform the hash, not just the System.Security.Cryptography function.

Read the article
Parralelization in Microsoft SQL Server 2008 R2

- by stan31337

We have a specific accounting and production software, called 1C, which uses single user connection to the MS SQL 2008 R2 database. And there are about 500 users connecting to 1C server to perform their tasks. 1C and SQL 2008 are on separate servers. How to configure MS SQL 2008 R2 to effectively use parallelization in this configuration? We have 24 cores, and only one is loaded at 100% at MS SQL 2008 R2 server. We have already configured MS SQL max parallelizm from this MSDN article: http://msdn.microsoft.com/en-us/library/ms181007(v=sql.105).aspx Thank you!

Read the article
How much processor speed and cores do I need for these tasks?

- by ajay

I am planning to buy a new laptop as I find my current one very slow. My question here is specifically related to RAM size and CPU power. I will mostly be doing development (not much games). I would be dabbling in distributed computing, multithreaded and data intensive parallelizable tasks on multi-cores. For e.g. I would want to be able to Concurrent programming in Scala/Java/Clojure etc. and be able to see parallelization. Furthermore, I would want the RAM to be enough. But from a developer machine standpoint, do you think 4GB RAM and 2.53GHz Dual Core processor would be enough. I'm basically looking at this model: http://store.apple.com/us/configure/MC118LL/A?mco=MTM3NDcyODk (link dead)

Read the article
Parallelism in .NET – Part 7, Some Differences between PLINQ and LINQ to Objects

- by Reed

In my previous post on Declarative Data Parallelism, I mentioned that PLINQ extends LINQ to Objects to support parallel operations. Although nearly all of the same operations are supported, there are some differences between PLINQ and LINQ to Objects. By introducing Parallelism to our declarative model, we add some extra complexity. This, in turn, adds some extra requirements that must be addressed. In order to illustrate the main differences, and why they exist, let’s begin by discussing some differences in how the two technologies operate, and look at the underlying types involved in LINQ to Objects and PLINQ . LINQ to Objects is mainly built upon a single class: Enumerable. The Enumerable class is a static class that defines a large set of extension methods, nearly all of which work upon an IEnumerable<T>. Many of these methods return a new IEnumerable<T>, allowing the methods to be chained together into a fluent style interface. This is what allows us to write statements that chain together, and lead to the nice declarative programming model of LINQ: double min = collection .Where(item => item.SomeProperty > 6 && item.SomeProperty < 24) .Min(item => item.PerformComputation()); .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } Other LINQ variants work in a similar fashion. For example, most data-oriented LINQ providers are built upon an implementation of IQueryable<T>, which allows the database provider to turn a LINQ statement into an underlying SQL query, to be performed directly on the remote database. PLINQ is similar, but instead of being built upon the Enumerable class, most of PLINQ is built upon a new static class: ParallelEnumerable. When using PLINQ, you typically begin with any collection which implements IEnumerable<T>, and convert it to a new type using an extension method defined on ParallelEnumerable: AsParallel(). This method takes any IEnumerable<T>, and converts it into a ParallelQuery<T>, the core class for PLINQ. There is a similar ParallelQuery class for working with non-generic IEnumerable implementations. This brings us to our first subtle, but important difference between PLINQ and LINQ – PLINQ always works upon specific types, which must be explicitly created. Typically, the type you’ll use with PLINQ is ParallelQuery<T>, but it can sometimes be a ParallelQuery or an OrderedParallelQuery<T>. Instead of dealing with an interface, implemented by an unknown class, we’re dealing with a specific class type. This works seamlessly from a usage standpoint – ParallelQuery<T> implements IEnumerable<T>, so you can always “switch back” to an IEnumerable<T>. The difference only arises at the beginning of our parallelization. When we’re using LINQ, and we want to process a normal collection via PLINQ, we need to explicitly convert the collection into a ParallelQuery<T> by calling AsParallel(). There is an important consideration here – AsParallel() does not need to be called on your specific collection, but rather any IEnumerable<T>. This allows you to place it anywhere in the chain of methods involved in a LINQ statement, not just at the beginning. This can be useful if you have an operation which will not parallelize well or is not thread safe. For example, the following is perfectly valid, and similar to our previous examples: double min = collection .AsParallel() .Select(item => item.SomeOperation()) .Where(item => item.SomeProperty > 6 && item.SomeProperty < 24) .Min(item => item.PerformComputation()); However, if SomeOperation() is not thread safe, we could just as easily do: double min = collection .Select(item => item.SomeOperation()) .AsParallel() .Where(item => item.SomeProperty > 6 && item.SomeProperty < 24) .Min(item => item.PerformComputation()); In this case, we’re using standard LINQ to Objects for the Select(…) method, then converting the results of that map routine to a ParallelQuery<T>, and processing our filter (the Where method) and our aggregation (the Min method) in parallel. PLINQ also provides us with a way to convert a ParallelQuery<T> back into a standard IEnumerable<T>, forcing sequential processing via standard LINQ to Objects. If SomeOperation() was thread-safe, but PerformComputation() was not thread-safe, we would need to handle this by using the AsEnumerable() method: double min = collection .AsParallel() .Select(item => item.SomeOperation()) .Where(item => item.SomeProperty > 6 && item.SomeProperty < 24) .AsEnumerable() .Min(item => item.PerformComputation()); Here, we’re converting our collection into a ParallelQuery<T>, doing our map operation (the Select(…) method) and our filtering in parallel, then converting the collection back into a standard IEnumerable<T>, which causes our aggregation via Min() to be performed sequentially. This could also be written as two statements, as well, which would allow us to use the language integrated syntax for the first portion: var tempCollection = from item in collection.AsParallel() let e = item.SomeOperation() where (e.SomeProperty > 6 && e.SomeProperty < 24) select e; double min = tempCollection.AsEnumerable().Min(item => item.PerformComputation()); This allows us to use the standard LINQ style language integrated query syntax, but control whether it’s performed in parallel or serial by adding AsParallel() and AsEnumerable() appropriately. The second important difference between PLINQ and LINQ deals with order preservation. PLINQ, by default, does not preserve the order of of source collection. This is by design. In order to process a collection in parallel, the system needs to naturally deal with multiple elements at the same time. Maintaining the original ordering of the sequence adds overhead, which is, in many cases, unnecessary. Therefore, by default, the system is allowed to completely change the order of your sequence during processing. If you are doing a standard query operation, this is usually not an issue. However, there are times when keeping a specific ordering in place is important. If this is required, you can explicitly request the ordering be preserved throughout all operations done on a ParallelQuery<T> by using the AsOrdered() extension method. This will cause our sequence ordering to be preserved. For example, suppose we wanted to take a collection, perform an expensive operation which converts it to a new type, and display the first 100 elements. In LINQ to Objects, our code might look something like: // Using IEnumerable<SourceClass> collection IEnumerable<ResultClass> results = collection .Select(e => e.CreateResult()) .Take(100); If we just converted this to a parallel query naively, like so: IEnumerable<ResultClass> results = collection .AsParallel() .Select(e => e.CreateResult()) .Take(100); We could very easily get a very different, and non-reproducable, set of results, since the ordering of elements in the input collection is not preserved. To get the same results as our original query, we need to use: IEnumerable<ResultClass> results = collection .AsParallel() .AsOrdered() .Select(e => e.CreateResult()) .Take(100); This requests that PLINQ process our sequence in a way that verifies that our resulting collection is ordered as if it were processed serially. This will cause our query to run slower, since there is overhead involved in maintaining the ordering. However, in this case, it is required, since the ordering is required for correctness. PLINQ is incredibly useful. It allows us to easily take nearly any LINQ to Objects query and run it in parallel, using the same methods and syntax we’ve used previously. There are some important differences in operation that must be considered, however – it is not a free pass to parallelize everything. When using PLINQ in order to parallelize your routines declaratively, the same guideline I mentioned before still applies: Parallelization is something that should be handled with care and forethought, added by design, and not just introduced casually.

Read the article
Parallelism in .NET – Part 4, Imperative Data Parallelism: Aggregation

- by Reed

In the article on simple data parallelism, I described how to perform an operation on an entire collection of elements in parallel. Often, this is not adequate, as the parallel operation is going to be performing some form of aggregation. Simple examples of this might include taking the sum of the results of processing a function on each element in the collection, or finding the minimum of the collection given some criteria. This can be done using the techniques described in simple data parallelism, however, special care needs to be taken into account to synchronize the shared data appropriately. The Task Parallel Library has tools to assist in this synchronization. The main issue with aggregation when parallelizing a routine is that you need to handle synchronization of data. Since multiple threads will need to write to a shared portion of data. Suppose, for example, that we wanted to parallelize a simple loop that looked for the minimum value within a dataset: double min = double.MaxValue; foreach(var item in collection) { double value = item.PerformComputation(); min = System.Math.Min(min, value); } .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } This seems like a good candidate for parallelization, but there is a problem here. If we just wrap this into a call to Parallel.ForEach, we’ll introduce a critical race condition, and get the wrong answer. Let’s look at what happens here: // Buggy code! Do not use! double min = double.MaxValue; Parallel.ForEach(collection, item => { double value = item.PerformComputation(); min = System.Math.Min(min, value); }); This code has a fatal flaw: min will be checked, then set, by multiple threads simultaneously. Two threads may perform the check at the same time, and set the wrong value for min. Say we get a value of 1 in thread 1, and a value of 2 in thread 2, and these two elements are the first two to run. If both hit the min check line at the same time, both will determine that min should change, to 1 and 2 respectively. If element 1 happens to set the variable first, then element 2 sets the min variable, we’ll detect a min value of 2 instead of 1. This can lead to wrong answers. Unfortunately, fixing this, with the Parallel.ForEach call we’re using, would require adding locking. We would need to rewrite this like: // Safe, but slow double min = double.MaxValue; // Make a "lock" object object syncObject = new object(); Parallel.ForEach(collection, item => { double value = item.PerformComputation(); lock(syncObject) min = System.Math.Min(min, value); }); This will potentially add a huge amount of overhead to our calculation. Since we can potentially block while waiting on the lock for every single iteration, we will most likely slow this down to where it is actually quite a bit slower than our serial implementation. The problem is the lock statement – any time you use lock(object), you’re almost assuring reduced performance in a parallel situation. This leads to two observations I’ll make: When parallelizing a routine, try to avoid locks. That being said: Always add any and all required synchronization to avoid race conditions. These two observations tend to be opposing forces – we often need to synchronize our algorithms, but we also want to avoid the synchronization when possible. Looking at our routine, there is no way to directly avoid this lock, since each element is potentially being run on a separate thread, and this lock is necessary in order for our routine to function correctly every time. However, this isn’t the only way to design this routine to implement this algorithm. Realize that, although our collection may have thousands or even millions of elements, we have a limited number of Processing Elements (PE). Processing Element is the standard term for a hardware element which can process and execute instructions. This typically is a core in your processor, but many modern systems have multiple hardware execution threads per core. The Task Parallel Library will not execute the work for each item in the collection as a separate work item. Instead, when Parallel.ForEach executes, it will partition the collection into larger “chunks” which get processed on different threads via the ThreadPool. This helps reduce the threading overhead, and help the overall speed. In general, the Parallel class will only use one thread per PE in the system. Given the fact that there are typically fewer threads than work items, we can rethink our algorithm design. We can parallelize our algorithm more effectively by approaching it differently. Because the basic aggregation we are doing here (Min) is communitive, we do not need to perform this in a given order. We knew this to be true already – otherwise, we wouldn’t have been able to parallelize this routine in the first place. With this in mind, we can treat each thread’s work independently, allowing each thread to serially process many elements with no locking, then, after all the threads are complete, “merge” together the results. This can be accomplished via a different set of overloads in the Parallel class: Parallel.ForEach<TSource,TLocal>. The idea behind these overloads is to allow each thread to begin by initializing some local state (TLocal). The thread will then process an entire set of items in the source collection, providing that state to the delegate which processes an individual item. Finally, at the end, a separate delegate is run which allows you to handle merging that local state into your final results. To rewriting our routine using Parallel.ForEach<TSource,TLocal>, we need to provide three delegates instead of one. The most basic version of this function is declared as: public static ParallelLoopResult ForEach<TSource, TLocal>( IEnumerable<TSource> source, Func<TLocal> localInit, Func<TSource, ParallelLoopState, TLocal, TLocal> body, Action<TLocal> localFinally ) The first delegate (the localInit argument) is defined as Func<TLocal>. This delegate initializes our local state. It should return some object we can use to track the results of a single thread’s operations. The second delegate (the body argument) is where our main processing occurs, although now, instead of being an Action<T>, we actually provide a Func<TSource, ParallelLoopState, TLocal, TLocal> delegate. This delegate will receive three arguments: our original element from the collection (TSource), a ParallelLoopState which we can use for early termination, and the instance of our local state we created (TLocal). It should do whatever processing you wish to occur per element, then return the value of the local state after processing is completed. The third delegate (the localFinally argument) is defined as Action<TLocal>. This delegate is passed our local state after it’s been processed by all of the elements this thread will handle. This is where you can merge your final results together. This may require synchronization, but now, instead of synchronizing once per element (potentially millions of times), you’ll only have to synchronize once per thread, which is an ideal situation. Now that I’ve explained how this works, lets look at the code: // Safe, and fast! double min = double.MaxValue; // Make a "lock" object object syncObject = new object(); Parallel.ForEach( collection, // First, we provide a local state initialization delegate. () => double.MaxValue, // Next, we supply the body, which takes the original item, loop state, // and local state, and returns a new local state (item, loopState, localState) => { double value = item.PerformComputation(); return System.Math.Min(localState, value); }, // Finally, we provide an Action<TLocal>, to "merge" results together localState => { // This requires locking, but it's only once per used thread lock(syncObj) min = System.Math.Min(min, localState); } ); Although this is a bit more complicated than the previous version, it is now both thread-safe, and has minimal locking. This same approach can be used by Parallel.For, although now, it’s Parallel.For<TLocal>. When working with Parallel.For<TLocal>, you use the same triplet of delegates, with the same purpose and results. Also, many times, you can completely avoid locking by using a method of the Interlocked class to perform the final aggregation in an atomic operation. The MSDN example demonstrating this same technique using Parallel.For uses the Interlocked class instead of a lock, since they are doing a sum operation on a long variable, which is possible via Interlocked.Add. By taking advantage of local state, we can use the Parallel class methods to parallelize algorithms such as aggregation, which, at first, may seem like poor candidates for parallelization. Doing so requires careful consideration, and often requires a slight redesign of the algorithm, but the performance gains can be significant if handled in a way to avoid excessive synchronization.

Read the article
Parallelism in .NET – Part 11, Divide and Conquer via Parallel.Invoke

- by Reed

Many algorithms are easily written to work via recursion. For example, most data-oriented tasks where a tree of data must be processed are much more easily handled by starting at the root, and recursively “walking” the tree. Some algorithms work this way on flat data structures, such as arrays, as well. This is a form of divide and conquer: an algorithm design which is based around breaking up a set of work recursively, “dividing” the total work in each recursive step, and “conquering” the work when the remaining work is small enough to be solved easily. Recursive algorithms, especially ones based on a form of divide and conquer, are often a very good candidate for parallelization. This is apparent from a common sense standpoint. Since we’re dividing up the total work in the algorithm, we have an obvious, built-in partitioning scheme. Once partitioned, the data can be worked upon independently, so there is good, clean isolation of data. Implementing this type of algorithm is fairly simple. The Parallel class in .NET 4 includes a method suited for this type of operation: Parallel.Invoke. This method works by taking any number of delegates defined as an Action, and operating them all in parallel. The method returns when every delegate has completed: Parallel.Invoke( () => { Console.WriteLine("Action 1 executing in thread {0}", Thread.CurrentThread.ManagedThreadId); }, () => { Console.WriteLine("Action 2 executing in thread {0}", Thread.CurrentThread.ManagedThreadId); }, () => { Console.WriteLine("Action 3 executing in thread {0}", Thread.CurrentThread.ManagedThreadId); } ); .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } Running this simple example demonstrates the ease of using this method. For example, on my system, I get three separate thread IDs when running the above code. By allowing any number of delegates to be executed directly, concurrently, the Parallel.Invoke method provides us an easy way to parallelize any algorithm based on divide and conquer. We can divide our work in each step, and execute each task in parallel, recursively. For example, suppose we wanted to implement our own quicksort routine. The quicksort algorithm can be designed based on divide and conquer. In each iteration, we pick a pivot point, and use that to partition the total array. We swap the elements around the pivot, then recursively sort the lists on each side of the pivot. For example, let’s look at this simple, sequential implementation of quicksort: public static void QuickSort<T>(T[] array) where T : IComparable<T> { QuickSortInternal(array, 0, array.Length - 1); } private static void QuickSortInternal<T>(T[] array, int left, int right) where T : IComparable<T> { if (left >= right) { return; } SwapElements(array, left, (left + right) / 2); int last = left; for (int current = left + 1; current <= right; ++current) { if (array[current].CompareTo(array[left]) < 0) { ++last; SwapElements(array, last, current); } } SwapElements(array, left, last); QuickSortInternal(array, left, last - 1); QuickSortInternal(array, last + 1, right); } static void SwapElements<T>(T[] array, int i, int j) { T temp = array[i]; array[i] = array[j]; array[j] = temp; } Here, we implement the quicksort algorithm in a very common, divide and conquer approach. Running this against the built-in Array.Sort routine shows that we get the exact same answers (although the framework’s sort routine is slightly faster). On my system, for example, I can use framework’s sort to sort ten million random doubles in about 7.3s, and this implementation takes about 9.3s on average. Looking at this routine, though, there is a clear opportunity to parallelize. At the end of QuickSortInternal, we recursively call into QuickSortInternal with each partition of the array after the pivot is chosen. This can be rewritten to use Parallel.Invoke by simply changing it to: // Code above is unchanged... SwapElements(array, left, last); Parallel.Invoke( () => QuickSortInternal(array, left, last - 1), () => QuickSortInternal(array, last + 1, right) ); } This routine will now run in parallel. When executing, we now see the CPU usage across all cores spike while it executes. However, there is a significant problem here – by parallelizing this routine, we took it from an execution time of 9.3s to an execution time of approximately 14 seconds! We’re using more resources as seen in the CPU usage, but the overall result is a dramatic slowdown in overall processing time. This occurs because parallelization adds overhead. Each time we split this array, we spawn two new tasks to parallelize this algorithm! This is far, far too many tasks for our cores to operate upon at a single time. In effect, we’re “over-parallelizing” this routine. This is a common problem when working with divide and conquer algorithms, and leads to an important observation: When parallelizing a recursive routine, take special care not to add more tasks than necessary to fully utilize your system. This can be done with a few different approaches, in this case. Typically, the way to handle this is to stop parallelizing the routine at a certain point, and revert back to the serial approach. Since the first few recursions will all still be parallelized, our “deeper” recursive tasks will be running in parallel, and can take full advantage of the machine. This also dramatically reduces the overhead added by parallelizing, since we’re only adding overhead for the first few recursive calls. There are two basic approaches we can take here. The first approach would be to look at the total work size, and if it’s smaller than a specific threshold, revert to our serial implementation. In this case, we could just check right-left, and if it’s under a threshold, call the methods directly instead of using Parallel.Invoke. The second approach is to track how “deep” in the “tree” we are currently at, and if we are below some number of levels, stop parallelizing. This approach is a more general-purpose approach, since it works on routines which parse trees as well as routines working off of a single array, but may not work as well if a poor partitioning strategy is chosen or the tree is not balanced evenly. This can be written very easily. If we pass a maxDepth parameter into our internal routine, we can restrict the amount of times we parallelize by changing the recursive call to: // Code above is unchanged... SwapElements(array, left, last); if (maxDepth < 1) { QuickSortInternal(array, left, last - 1, maxDepth); QuickSortInternal(array, last + 1, right, maxDepth); } else { --maxDepth; Parallel.Invoke( () => QuickSortInternal(array, left, last - 1, maxDepth), () => QuickSortInternal(array, last + 1, right, maxDepth)); } We no longer allow this to parallelize indefinitely – only to a specific depth, at which time we revert to a serial implementation. By starting the routine with a maxDepth equal to Environment.ProcessorCount, we can restrict the total amount of parallel operations significantly, but still provide adequate work for each processing core. With this final change, my timings are much better. On average, I get the following timings: Framework via Array.Sort: 7.3 seconds Serial Quicksort Implementation: 9.3 seconds Naive Parallel Implementation: 14 seconds Parallel Implementation Restricting Depth: 4.7 seconds Finally, we are now faster than the framework’s Array.Sort implementation.

Read the article
Using Clojure instead of Python for scalability (multi core) reasons, good idea?

- by Vandell

After reading http://clojure.org/rationale and other performance comparisons between Clojure and many languages, I started to think that apart from ease of use, I shouldn't be coding in Python anymore, but in Clojure instead. Actually, I began to fill irresponsisble for not learning clojure seeing it's benefits. Does it make sense? Can't I make really efficient use of all cores using a more imperative language like Python, than a lisp dialect or other functional language? It seems that all the benefits of it come from using immutable data, can't I do just that in Python and have all the benefits? I once started to learn some Common Lisp, read and done almost all exercices from a book I borrowod from my university library (I found it to be pretty good, despite it's low popularity on Amazon). But, after a while, I got myself struggling to much to do some simple things. I think there's somethings that are more imperative in their nature, that makes it difficult to model those thins in a functional way, I guess. The thing is, is Python as powerful as Clojure for building applications that takes advantages of this new multi core future? Note that I don't think that using semaphores, lock mechanisms or other similar concurrency mechanism are good alternatives to Clojure 'automatic' parallelization.

Read the article
deciding between subprocess, multiprocesser and thread in Python?

- by user248237

I'd like to parallelize my Python program so that it can make use of multiple processors on the machine that it runs on. My parallelization is very simple, in that all the parallel "threads" of the program are independent and write their output to separate files. I don't need the threads to exchange information but it is imperative that I know when the threads finish since some steps of my pipeline depend on their output. Portability is important, in that I'd like this to run on any Python version on Mac, Linux and Windows. Given these constraints, which is the most appropriate Python module for implementing this? I am tryign to decide between thread, subprocess and multiprocessing, which all seem to provide related functionality. Any thoughts on this? I'd like the simplest solution that's portable. Thanks.

Read the article
Parallelism in Python

- by fmark

What are the options for achieving parallelism in Python? I want to perform a bunch of CPU bound calculations over some very large rasters, and would like to parallelise them. Coming from a C background, I am familiar with three approaches to parallelism: Message passing processes, possibly distributed across a cluster, e.g. MPI. Explicit shared memory parallelism, either using pthreads or fork(), pipe(), et. al Implicit shared memory parallelism, using OpenMP. Deciding on an approach to use is an exercise in trade-offs. In Python, what approaches are available and what are their characteristics? Is there a clusterable MPI clone? What are the preferred ways of achieving shared memory parallelism? I have heard reference to problems with the GIL, as well as references to tasklets. In short, what do I need to know about the different parallelization strategies in Python before choosing between them?

Read the article
Library for Dataflow in C

- by msutherl

How can I do dataflow (pipes and filters, stream processing, flow based) in C? And not with UNIX pipes. I recently came across stream.py. Streams are iterables with a pipelining mechanism to enable data-flow programming and easy parallelization. The idea is to take the output of a function that turns an iterable into another iterable and plug that as the input of another such function. While you can already do this using function composition, this package provides an elegant notation for it by overloading the operator. I would like to duplicate a simple version of this kind of functionality in C. I particularly like the overloading of the operator to avoid function composition mess. Wikipedia points to this hint from a Usenet post in 1990. Why C? Because I would like to be able to do this on microcontrollers and in C extensions for other high level languages (Max, Pd*, Python). * (ironic given that Max and Pd were written, in C, specifically for this purpose – I'm looking for something barebones)

Read the article
Oracle Releases New Mainframe Re-Hosting in Oracle Tuxedo 11g

- by Jason Williamson

I'm excited to say that we've released our next generation of Re-hosting in 11g. In fact I'm doing some hands-on labs now for our Systems Integrators in Italy in a couple of weeks and targeting Latin America next month. If you are an SI, or Rehosting firm and are looking to become an Oracle Partner or get a better understanding of Tuxedo and how to use the workbench for rehosting...drop me a line. Oracle Tuxedo Application Runtime for CICS and Batch 11g provides a CICS API emulation and Batch environment that exploits the full range of Oracle Tuxedo's capabilities. Re-hosted applications run in a multi-node, grid environment with centralized production control. Also, enterprise integration of CICS application services benefits from an open and SOA-enabled framework. Key features include: CICS Application Runtime: Can run IBM CICS applications unchanged in an application grid, which enables the distribution of large workloads across multiple processors and nodes. This simplifies CICS administration and can scale to over 100,000 users and over 50,000 transactions per second. 3270 Terminal Server: Protects business users from change through support for tn3270 terminal emulation. Distributed CICS Resource Management: Simplifies deployment and administration by allowing customers to run CICS regions in a distributed configuration. Batch Application Runtime: Provides robust IBM JES-like job management that enables local or remote job submissions. In addition, distributed batch initiators can enable parallelization of jobs and support fail-over, shortening the batch window and helping to meet stringent SLAs. Batch Execution Environment: Helps to run IBM batch unchanged and also supports JCL functionality and all common batch utilities. Oracle Tuxedo Application Rehosting Workbench 11g provides a set of automated migration tools integrated around a central repository. The tools provide high precision which results in very low error rates and the ability to handle large applications. This enables less expensive, low-risk migration projects. Key capabilities include: Workbench Repository and Cataloguer: Ensures integrity of the migrated application assets through full dependency checking. The Cataloguer generates and maintains all relevant meta-data on source and target components. File Migrator: Supports reliable migration of datasets and flat files to an ISAM or Oracle Database 11g. This is done through the automated migration utilities for data unloading, reloading and validation. It also generates logical access functions to shield developers from data repository changes. DB2 Migrator: Similarly, this tool automates the migration of DB2 schema and data to Oracle Database 11g. COBOL Migrator: Supports migration of IBM mainframe COBOL assets (OLTP and Batch) to open systems. Adapts programs for compiler dialects and data access variations. JCL Migrator: Supports migration of IBM JCL jobs to a Tuxedo ART environment, maintaining the flow and characteristics of batch jobs.

Read the article
C++ Accelerated Massive Parallelism

- by Daniel Moth

At AMD's Fusion conference Herb Sutter announced in his keynote session a technology that our team has been working on that we call C++ Accelerated Massive Parallelism (C++ AMP) and during the keynote I showed a brief demo of an app built with our technology. After the keynote, I go deeper into the technology in my breakout session. If you read both those abstracts, you'll get some information about what C++ AMP is, without being too explicit since we published the abstracts before the technology was announced. You can find the official online announcement at Soma's blog post. Here, I just wanted to capture the key points about C++ AMP that can serve as an introduction and an FAQ. So, in no particular order… C++ AMP lowers the barrier to entry for heterogeneous hardware programmability and brings performance to the mainstream, without sacrificing developer productivity or solution portability. is designed not only to help you address today's massively parallel hardware (i.e. GPUs and APUs), but it also future proofs your code investments with a forward looking design. is part of Visual C++. You don't need to use a different compiler or learn different syntax. is modern C++. Not C or some other derivative. is integrated and supported fully in Visual Studio vNext. Editing, building, debugging, profiling and all the other goodness of Visual Studio work well with C++ AMP. provides an STL-like library as part of the existing concurrency namespace and delivered in the new amp.h header file. makes it extremely easy to work with large multi-dimensional data on heterogeneous hardware; in a manner that exposes parallelization. introduces only one core C++ language extension. builds on DirectX (and DirectCompute in particular) which offers a great hardware abstraction layer that is ubiquitous and reliable. The architecture is such, that this point can be thought of as an implementation detail that does not surface to the API layer. Stay tuned on my blog for more over the coming months where I will switch from just talking about C++ AMP to showing you how to use the API with code examples… Comments about this post welcome at the original blog.

Read the article

Search Results

Search found 55 results on 3 pages for 'parallelization'.

Page 1/3 | 1 2 3 | Next Page >

- by ehsanul

- by veda

- by dahpgjgamgan

- by reve_etrange

- by Hristo

- by JuergenKress

- by thatjeffsmith

- by McMuttons

- by Benubird

- by Werner

- by Peter

- by stan31337

- by ajay

- by Reed

- by Reed

- by Reed

- by Vandell

- by user248237

- by fmark

- by msutherl

- by Jason Williamson

- by Daniel Moth

1 2 3 | Next Page >