Search Results

Search found 1774 results on 71 pages for 'parallel for'.

Page 18/71 | < Previous Page | 14 15 16 17 18 19 20 21 22 23 24 25 | Next Page >

Programming for Multi core Processors

- by Chathuranga Chandrasekara

As far as I know, the multi-core architecture in a processor does not effect the program. The actual instruction execution is handled in a lower layer. my question is, Given that you have a multicore environment, Can I use any programming practices to utilize the available resources more effectively? How should I change my code to gain more performance in multicore environments?

Read the article
local variable 'sresult' referenced before assignment

- by user288558

I have had multiple problems trying to use PP. I am running python2.6 and pp 1.6.0 rc3. Using the following test code: import pp nodes=('mosura02','mosura03','mosura04','mosura05','mosura06', 'mosura09','mosura10','mosura11','mosura12') def pptester(): js=pp.Server(ppservers=nodes) tmp=[] for i in range(200): tmp.append(js.submit(ppworktest,(),(),('os',))) return tmp def ppworktest(): return os.system("uname -a") gives me the following result: In [10]: Exception in thread run_local: Traceback (most recent call last): File "/usr/lib64/python2.6/threading.py", line 525, in __bootstrap_inner self.run() File "/usr/lib64/python2.6/threading.py", line 477, in run self.__target(*self.__args, **self.__kwargs) File "/home/wkerzend/python_coala/lib/python2.6/site-packages/pp.py", line 751, in _run_local job.finalize(sresult) UnboundLocalError: local variable 'sresult' referenced before assignment Exception in thread run_local: Traceback (most recent call last): File "/usr/lib64/python2.6/threading.py", line 525, in __bootstrap_inner self.run() File "/usr/lib64/python2.6/threading.py", line 477, in run self.__target(*self.__args, **self.__kwargs) File "/home/wkerzend/python_coala/lib/python2.6/site-packages/pp.py", line 751, in _run_local job.finalize(sresult) UnboundLocalError: local variable 'sresult' referenced before assignment Exception in thread run_local: Traceback (most recent call last): File "/usr/lib64/python2.6/threading.py", line 525, in __bootstrap_inner self.run() File "/usr/lib64/python2.6/threading.py", line 477, in run self.__target(*self.__args, **self.__kwargs) File "/home/wkerzend/python_coala/lib/python2.6/site-packages/pp.py", line 751, in _run_local job.finalize(sresult) UnboundLocalError: local variable 'sresult' referenced before assignment Exception in thread run_local: Traceback (most recent call last): File "/usr/lib64/python2.6/threading.py", line 525, in __bootstrap_inner self.run() File "/usr/lib64/python2.6/threading.py", line 477, in run self.__target(*self.__args, **self.__kwargs) File "/home/wkerzend/python_coala/lib/python2.6/site-packages/pp.py", line 751, in _run_local job.finalize(sresult) UnboundLocalError: local variable 'sresult' referenced before assignment any help greatly appreciated

Read the article
Concurrent cartesian product algorithm in Clojure

- by jqno

Is there a good algorithm to calculate the cartesian product of three seqs concurrently in Clojure? I'm working on a small hobby project in Clojure, mainly as a means to learn the language, and its concurrency features. In my project, I need to calculate the cartesian product of three seqs (and do something with the results). I found the cartesian-product function in clojure.contrib.combinatorics, which works pretty well. However, the calculation of the cartesian product turns out to be the bottleneck of the program. Therefore, I'd like to perform the calculation concurrently. Now, for the map function, there's a convenient pmap alternative that magically makes the thing concurrent. Which is cool :). Unfortunately, such a thing doesn't exist for cartesian-product. I've looked at the source code, but I can't find an easy way to make it concurrent myself. Also, I've tried to implement an algorithm myself using map, but I guess my algorithmic skills aren't what they used to be. I managed to come up with something ugly for two seqs, but three was definitely a bridge too far. So, does anyone know of an algorithm that's already concurrent, or one that I can parallelize myself?

Read the article
Can I remove items from a ConcurrentDictionary from within an enumeration loop of that dictionary?

- by the-locster

So for example: ConcurrentDictionary<string,Payload> itemCache = GetItems(); foreach(KeyValuePair<string,Payload> kvPair in itemCache) { if(TestItemExpiry(kvPair.Value)) { // Remove expired item. Payload removedItem; itemCache.TryRemove(kvPair.Key, out removedItem); } } Obviously with an ordinary Dictionary this will throw an exception because removing items changes the dictionary's internal state during the life of the enumeration. It's my understanding that this is not the case for a ConcurrentDictionary as the provided IEnumerable handles internal state changing. Am I understanding this right? Is there a better pattern to use?

Read the article
F# performance in scientific computing

- by aaa

hello. I am curious as to how F# performance compares to C++ performance? I asked a similar question with regards to Java, and the impression I got was that Java is not suitable for heavy numbercrunching. I have read that F# is supposed to be more scalable and more performant, but how is this real-world performance compares to C++? specific questions about current implementation are: How well does it do floating-point? Does it allow vector instructions how friendly is it towards optimizing compilers? How big a memory foot print does it have? Does it allow fine-grained control over memory locality? does it have capacity for distributed memory processors, for example Cray? what features does it have that may be of interest to computational science where heavy number processing is involved? Are there actual scientific computing implementations that use it? Thanks

Read the article
Hadooop map reduce

- by Aina Ari

Im very much new to map reduce and i completed hadoop wordcount example. In that example it produces unsorted file (with key value) of word counts. So is it possible to make it sorted according to the most number of word occurrences by combining another map reduce task to the earlier one. Thanks in Advance

Read the article
How to setup matlabpool for multiple processors?

- by JohnIdol

I just setup a Extra Large Heavy Computation EC2 instance to throw it at my Genetic Algorithms problem, hoping to speed up things. This instance has 8 Intel Xeon processors (around 2.4Ghz each) and 7 Gigs of RAM. On my machine I have an Intel Core Duo, and matlab is able to work with my two cores just fine by runinng: matlabpool open 2 On the EC2 instance though, matlab only is capable of detecting 1 out of 8 processors, and if I try running: matlabpool open 8 I get an error saying that the ClusterSize is 1 since there's only 1 core on my CPU. True, there is only 1 core on each CPU, but I have 8 CPUs on the given EC2 instance! So the difference from my machine and the ec2 instance is that I have my 2 cores on a single processor locally, while the EC2 instance has 8 distinct processors. My question is, how do I get matlab to work with those 8 processors? I found this paper, but it seems related to setting up matlab with multiple EC2 instances (not related to multiple processors on the same instance, EC2 or not), which is not my problem. Any help appreciated!

Read the article
Return data from subroutine while the subroutine is still processing

- by Perl QuestionAsker

Is there any way to have a subroutine send data back while still processing? For instance (this example used simply to illustrate) - a subroutine reads a file. While it is reading through the file, if some condition is met, then "return" that line and keep processing. I know there are those that will answer - why would you want to do that? and why don't you just ...?, but I really would like to know if this is possible. Thank you so much in advance.

Read the article
Strategies to use Database Sequences?

- by Bruno Brant

Hello all, I have a high-end architecture which receives many requests every second (in fact, it can receive many requests every millisecond). The architecture is designed so that some controls rely on a certain unique id assigned to each request. To create such UID we use a DB2 Sequence. Right now I already understand that this approach is flawed, since using the database is costly, but it makes sense to do so because this value will also be used to log information on the database. My team has just found out an increase of almost 1000% in elapsed time for each transaction, which we are assuming happened because of the sequence. Now I wonder, using sequences will serialize access to my application? Since they have to guarantee that increments works the way they should, they have to, right? So, are there better strategies when using sequences? Please assume that I have no other way of obtaining a unique id other than relying on the database.

Read the article
CPU Affinity Masks (Putting Threads on different CPUs)

- by hahuang65

I have 4 threads, and I am trying to set thread 1 to run on CPU 1, thread 2 on CPU 2, etc. However, when I run my code below, the affinity masks are returning the correct values, but when I do a sched_getcpu() on the threads, they all return that they are running on CPU 4. Anybody know what my problem here is? Thanks in advance! #define _GNU_SOURCE #include <stdio.h> #include <pthread.h> #include <stdlib.h> #include <sched.h> #include <errno.h> void *pthread_Message(char *message) { printf("%s is running on CPU %d\n", message, sched_getcpu()); } int main() { pthread_t thread1, thread2, thread3, thread4; pthread_t threadArray[4]; cpu_set_t cpu1, cpu2, cpu3, cpu4; char *thread1Msg = "Thread 1"; char *thread2Msg = "Thread 2"; char *thread3Msg = "Thread 3"; char *thread4Msg = "Thread 4"; int thread1Create, thread2Create, thread3Create, thread4Create, i, temp; CPU_ZERO(&cpu1); CPU_SET(1, &cpu1); temp = pthread_setaffinity_np(thread1, sizeof(cpu_set_t), &cpu1); printf("Set returned by pthread_getaffinity_np() contained:\n"); for (i = 0; i < CPU_SETSIZE; i++) if (CPU_ISSET(i, &cpu1)) printf("CPU1: CPU %d\n", i); CPU_ZERO(&cpu2); CPU_SET(2, &cpu2); temp = pthread_setaffinity_np(thread2, sizeof(cpu_set_t), &cpu2); for (i = 0; i < CPU_SETSIZE; i++) if (CPU_ISSET(i, &cpu2)) printf("CPU2: CPU %d\n", i); CPU_ZERO(&cpu3); CPU_SET(3, &cpu3); temp = pthread_setaffinity_np(thread3, sizeof(cpu_set_t), &cpu3); for (i = 0; i < CPU_SETSIZE; i++) if (CPU_ISSET(i, &cpu3)) printf("CPU3: CPU %d\n", i); CPU_ZERO(&cpu4); CPU_SET(4, &cpu4); temp = pthread_setaffinity_np(thread4, sizeof(cpu_set_t), &cpu4); for (i = 0; i < CPU_SETSIZE; i++) if (CPU_ISSET(i, &cpu4)) printf("CPU4: CPU %d\n", i); thread1Create = pthread_create(&thread1, NULL, (void *)pthread_Message, thread1Msg); thread2Create = pthread_create(&thread2, NULL, (void *)pthread_Message, thread2Msg); thread3Create = pthread_create(&thread3, NULL, (void *)pthread_Message, thread3Msg); thread4Create = pthread_create(&thread4, NULL, (void *)pthread_Message, thread4Msg); pthread_join(thread1, NULL); pthread_join(thread2, NULL); pthread_join(thread3, NULL); pthread_join(thread4, NULL); return 0; }

Read the article
Submitting R jobs using PBS

- by Tony

I am submitting a job using qsub that runs parallelized R. My intention is to have R programme running on 4 different cores rather than 8 cores. Here are some of my settings in PBS file: #PBS -l nodes=1:ppn=4 .... time R --no-save < program1.R > program1.log I am issuing the command "ta job_id" and I'm seeing that 4 cores are listed. However, the job occupies a large amount of memory(31944900k used vs 32949628k total). If I were to use 8 cores, the jobs got hang due to memory limitation. top - 21:03:53 up 77 days, 11:54, 0 users, load average: 3.99, 3.75, 3.37 Tasks: 207 total, 5 running, 202 sleeping, 0 stopped, 0 zombie Cpu(s): 30.4%us, 1.6%sy, 0.0%ni, 66.8%id, 0.0%wa, 0.0%hi, 1.2%si, 0.0%st Mem: 32949628k total, 31944900k used, 1004728k free, 269812k buffers Swap: 2097136k total, 8360k used, 2088776k free, 6030856k cached Here is a snapshot when issuing command ta job_id PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1794 x 25 0 6247m 6.0g 1780 R 99.2 19.1 8:14.37 R 1795 x 25 0 6332m 6.1g 1780 R 99.2 19.4 8:14.37 R 1796 x 25 0 6242m 6.0g 1784 R 99.2 19.1 8:14.37 R 1797 x 25 0 6322m 6.1g 1780 R 99.2 19.4 8:14.33 R 1714 x 18 0 65932 1504 1248 S 0.0 0.0 0:00.00 bash 1761 x 18 0 63840 1244 1052 S 0.0 0.0 0:00.00 20016.hpc 1783 x 18 0 133m 7096 1128 S 0.0 0.0 0:00.00 python 1786 x 18 0 137m 46m 2688 S 0.0 0.1 0:02.06 R How can I prevent other users to use the other 4 cores? I like to mask somehow that my job is using 8 cores with 4 cores idling. Could anyone kindly help me out on this? Can this be solved using pbs? Many Thanks

Read the article
chaining array of tasks with continuation

- by Andrei Cristof

I have a Task structure that is a little bit complex(for me at least). The structure is: (where T = Task) T1, T2, T3... Tn. There's an array (a list of files), and the T's represent tasks created for each file. Each T has always two subtasks that it must complete or fail: Tn.1, Tn.2 - download and install. For each download (Tn.1) there are always two subtasks to try, download from two paths(Tn.1.1, Tn.1.2). Execution would be: First, download file: Tn1.1. If Tn.1.1 fails, then Tn.1.2 executes. If either from download tasks returns OK - execute Tn.2. If Tn.2 executed or failed - go to next Tn. I figured the first thing to do, was to write all this structure with jagged arrays: private void CreateTasks() { //main array Task<int>[][][] mainTask = new Task<int>[_queuedApps.Count][][]; for (int i = 0; i < mainTask.Length; i++) { Task<int>[][] arr = GenerateOperationTasks(); mainTask[i] = arr; } } private Task<int>[][] GenerateOperationTasks() { //two download tasks Task<int>[] downloadTasks = new Task<int>[2]; downloadTasks[0] = new Task<int>(() => { return 0; }); downloadTasks[1] = new Task<int>(() => { return 0; }); //one installation task Task<int>[] installTask = new Task<int>[1] { new Task<int>(() => { return 0; }) }; //operations Task is jagged - keeps tasks above Task<int>[][] operationTasks = new Task<int>[2][]; operationTasks[0] = downloadTasks; operationTasks[1] = installTask; return operationTasks; } So now I got my mainTask array of tasks, containing nicely ordered tasks just as described above. However after reading the docs on ContinuationTasks, I realise this does not help me since I must call e.g. Task.ContinueWith(Task2). I'm stumped about doing this on my mainTask array. I can't write mainTask[0].ContinueWith(mainTask[1]) because I dont know the size of the array. If I could somehow reference the next task in the array (but without knowing its index), but cant figure out how. Any ideas? Thank you very much for your help. Regards,

Read the article
MPIexec.exe Access denide

- by shake

I have installed microsoft compute cluster and MPI.net, now i have trouble to run program using mpiexec.exe on cluster. When i try to run it on console i get message: "Access Denied", and pop up: "mpiexec.exe is not valid win32 application". I tried google it, but found nothing. Pls help. :)

Read the article
.net 4.0 concurrent queue dictionary

- by freddy smith

I would like to use the new concurrent collections in .NET 4.0 to solve the following problem. The basic data structure I want to have is a producer consumer queue, there will be a single consumer and multiple producers. There are items of type A,B,C,D,E that will be added to this queue. Items of type A,B,C are added to the queue in the normal manner and processed in order. However items of type D or E can only exist in the queue zero or once. If one of these is to be added and there already exists another of the same type that has not yet been processed then this should update that other one in-place in the queue. The queue position would not change (i.e. would not go to the back of the queue) after the update. Which .NET 4.0 classes would be best for this?

Read the article
I don't understand how work call_once

- by SABROG

Please help me understand how work call_once Here is thread-safe code. I don't understand why this need Thread Local Storage and global_epoch variables. Variable _fast_pthread_once_per_thread_epoch can be changed to constant/enum like {FAST_PTHREAD_ONCE_INIT, BEING_INITIALIZED, FINISH_INITIALIZED}. Why needed count calls in global_epoch? I think this code can be rewriting with logc: if flag FINISH_INITIALIZED do nothing, else go to block with mutexes and this all. #ifndef FAST_PTHREAD_ONCE_H #define FAST_PTHREAD_ONCE_H #include #include typedef sig_atomic_t fast_pthread_once_t; #define FAST_PTHREAD_ONCE_INIT SIG_ATOMIC_MAX extern __thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch; #ifdef __cplusplus extern "C" { #endif extern void fast_pthread_once( pthread_once_t *once, void (*func)(void) ); inline static void fast_pthread_once_inline( fast_pthread_once_t *once, void (*func)(void) ) { fast_pthread_once_t x = *once; /* unprotected access */ if ( x _fast_pthread_once_per_thread_epoch ) { fast_pthread_once( once, func ); } } #ifdef __cplusplus } #endif #endif FAST_PTHREAD_ONCE_H Source fast_pthread_once.c The source is written in C. The lines of the primary function are numbered for reference in the subsequent correctness argument. #include "fast_pthread_once.h" #include static pthread_mutex_t mu = PTHREAD_MUTEX_INITIALIZER; /* protects global_epoch and all fast_pthread_once_t writes */ static pthread_cond_t cv = PTHREAD_COND_INITIALIZER; /* signalled whenever a fast_pthread_once_t is finalized */ #define BEING_INITIALIZED (FAST_PTHREAD_ONCE_INIT - 1) static fast_pthread_once_t global_epoch = 0; /* under mu */ __thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch; static void check( int x ) { if ( x == 0 ) abort(); } void fast_pthread_once( fast_pthread_once_t *once, void (*func)(void) ) { /*01*/ fast_pthread_once_t x = *once; /* unprotected access */ /*02*/ if ( x _fast_pthread_once_per_thread_epoch ) { /*03*/ check( pthread_mutex_lock(µ) == 0 ); /*04*/ if ( *once == FAST_PTHREAD_ONCE_INIT ) { /*05*/ *once = BEING_INITIALIZED; /*06*/ check( pthread_mutex_unlock(µ) == 0 ); /*07*/ (*func)(); /*08*/ check( pthread_mutex_lock(µ) == 0 ); /*09*/ global_epoch++; /*10*/ *once = global_epoch; /*11*/ check( pthread_cond_broadcast(&cv;) == 0 ); /*12*/ } else { /*13*/ while ( *once == BEING_INITIALIZED ) { /*14*/ check( pthread_cond_wait(&cv;, µ) == 0 ); /*15*/ } /*16*/ } /*17*/ _fast_pthread_once_per_thread_epoch = global_epoch; /*18*/ check (pthread_mutex_unlock(µ) == 0); } } This code from BOOST: #ifndef BOOST_THREAD_PTHREAD_ONCE_HPP #define BOOST_THREAD_PTHREAD_ONCE_HPP // once.hpp // // (C) Copyright 2007-8 Anthony Williams // // Distributed under the Boost Software License, Version 1.0. (See // accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) #include #include #include #include "pthread_mutex_scoped_lock.hpp" #include #include #include namespace boost { struct once_flag { boost::uintmax_t epoch; }; namespace detail { BOOST_THREAD_DECL boost::uintmax_t& get_once_per_thread_epoch(); BOOST_THREAD_DECL extern boost::uintmax_t once_global_epoch; BOOST_THREAD_DECL extern pthread_mutex_t once_epoch_mutex; BOOST_THREAD_DECL extern pthread_cond_t once_epoch_cv; } #define BOOST_ONCE_INITIAL_FLAG_VALUE 0 #define BOOST_ONCE_INIT {BOOST_ONCE_INITIAL_FLAG_VALUE} // Based on Mike Burrows fast_pthread_once algorithm as described in // http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2444.html template void call_once(once_flag& flag,Function f) { static boost::uintmax_t const uninitialized_flag=BOOST_ONCE_INITIAL_FLAG_VALUE; static boost::uintmax_t const being_initialized=uninitialized_flag+1; boost::uintmax_t const epoch=flag.epoch; boost::uintmax_t& this_thread_epoch=detail::get_once_per_thread_epoch(); if(epoch #endif I right understand, boost don't use atomic operation, so code from boost not thread-safe?

Read the article
Parallelism in Python

- by fmark

What are the options for achieving parallelism in Python? I want to perform a bunch of CPU bound calculations over some very large rasters, and would like to parallelise them. Coming from a C background, I am familiar with three approaches to parallelism: Message passing processes, possibly distributed across a cluster, e.g. MPI. Explicit shared memory parallelism, either using pthreads or fork(), pipe(), et. al Implicit shared memory parallelism, using OpenMP. Deciding on an approach to use is an exercise in trade-offs. In Python, what approaches are available and what are their characteristics? Is there a clusterable MPI clone? What are the preferred ways of achieving shared memory parallelism? I have heard reference to problems with the GIL, as well as references to tasklets. In short, what do I need to know about the different parallelization strategies in Python before choosing between them?

Read the article
How does Batcher Merge work at a high level?

- by Mike

I'm trying to grasp the concept of a Batcher Sort. However, most resources I've found online focus on proof entirely or on low-level pseudocode. Before I look at proofs, I'd like to understand how Batcher Sort works. Can someone give a high level overview of how Batcher Sort works(particularly the merge) without overly verbose pseudocode(I want to get the idea behind the Batcher Sort, not implement it)? Thanks!

Read the article
Python: Plot some data (matplotlib) without GIL

- by BandGap

Hello all, my problem is the GIL of course. While I'm analysing data it would be nice to present some plots in between (so it's not too boring waiting for results) But the GIL prevents this (and this is bringing me to the point of asking myself if Python was such a good idea in the first place). I can only display the plot, wait till the user closes it and commence calculations after that. A waste of time obviously. I already tried the subprocess and multiprocessing modules but can't seem to get them to work. Any thoughts on this one? Thanks

Read the article
programmatically controlling power sockets in the UK

- by cartoonfox

It's very simple. I want to plug a lamp into the UK mains supply. I want to be able to power it on and off from software - say from serial port commands, or by running a command-line or something I can get to from ruby or Java. I see lots written about how to do this with X10 with American power systems - but has anybody actually tried doing this in the UK? If you got this working: 1) Exactly what hardware did you use? 2) How do you control it from software? Thanks!

Read the article
parallelizing code using openmp

- by anubhav

Hi, The function below contains nested for loops. There are 3 of them. I have given the whole function below for easy understanding. I want to parallelize the code in the innermost for loop as it takes maximum CPU time. Then i can think about outer 2 for loops. I can see dependencies and internal inline functions in the innermost for loop . Can the innermost for loop be rewritten to enable parallelization using openmp pragmas. Please tell how. I am writing just the loop which i am interested in first and then the full function where this loop exists for referance. Interested in parallelizing the loop mentioned below. //* LOOP WHICH I WANT TO PARALLELIZE *// for (y = 0; y < 4; y++) { refptr = PelYline_11 (ref_pic, abs_y++, abs_x, img_height, img_width); LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; } The full function where this loop exists is below for referance. /*! *********************************************************************** * \brief * Setup the fast search for an macroblock *********************************************************************** */ void SetupFastFullPelSearch (short ref, int list) // <-- reference frame parameter, list0 or 1 { short pmv[2]; pel_t orig_blocks[256], *orgptr=orig_blocks, *refptr, *tem; // created pointer tem int offset_x, offset_y, x, y, range_partly_outside, ref_x, ref_y, pos, abs_x, abs_y, bindex, blky; int LineSadBlk0, LineSadBlk1, LineSadBlk2, LineSadBlk3; int max_width, max_height; int img_width, img_height; StorablePicture *ref_picture; pel_t *ref_pic; int** block_sad = BlockSAD[list][ref][7]; int search_range = max_search_range[list][ref]; int max_pos = (2*search_range+1) * (2*search_range+1); int list_offset = ((img->MbaffFrameFlag)&&(img->mb_data[img->current_mb_nr].mb_field))? img->current_mb_nr%2 ? 4 : 2 : 0; int apply_weights = ( (active_pps->weighted_pred_flag && (img->type == P_SLICE || img->type == SP_SLICE)) || (active_pps->weighted_bipred_idc && (img->type == B_SLICE))); ref_picture = listX[list+list_offset][ref]; //===== Use weighted Reference for ME ==== if (apply_weights && input->UseWeightedReferenceME) ref_pic = ref_picture->imgY_11_w; else ref_pic = ref_picture->imgY_11; max_width = ref_picture->size_x - 17; max_height = ref_picture->size_y - 17; img_width = ref_picture->size_x; img_height = ref_picture->size_y; //===== get search center: predictor of 16x16 block ===== SetMotionVectorPredictor (pmv, enc_picture->ref_idx, enc_picture->mv, ref, list, 0, 0, 16, 16); search_center_x[list][ref] = pmv[0] / 4; search_center_y[list][ref] = pmv[1] / 4; if (!input->rdopt) { //--- correct center so that (0,0) vector is inside --- search_center_x[list][ref] = max(-search_range, min(search_range, search_center_x[list][ref])); search_center_y[list][ref] = max(-search_range, min(search_range, search_center_y[list][ref])); } search_center_x[list][ref] += img->opix_x; search_center_y[list][ref] += img->opix_y; offset_x = search_center_x[list][ref]; offset_y = search_center_y[list][ref]; //===== copy original block for fast access ===== for (y = img->opix_y; y < img->opix_y+16; y++) for (x = img->opix_x; x < img->opix_x+16; x++) *orgptr++ = imgY_org [y][x]; //===== check if whole search range is inside image ===== if (offset_x >= search_range && offset_x <= max_width - search_range && offset_y >= search_range && offset_y <= max_height - search_range ) { range_partly_outside = 0; PelYline_11 = FastLine16Y_11; } else { range_partly_outside = 1; } //===== determine position of (0,0)-vector ===== if (!input->rdopt) { ref_x = img->opix_x - offset_x; ref_y = img->opix_y - offset_y; for (pos = 0; pos < max_pos; pos++) { if (ref_x == spiral_search_x[pos] && ref_y == spiral_search_y[pos]) { pos_00[list][ref] = pos; break; } } } //===== loop over search range (spiral search): get blockwise SAD ===== **// =====THIS IS THE PART WHERE NESTED FOR STARTS=====** for (pos = 0; pos < max_pos; pos++) // OUTERMOST FOR LOOP { abs_y = offset_y + spiral_search_y[pos]; abs_x = offset_x + spiral_search_x[pos]; if (range_partly_outside) { if (abs_y >= 0 && abs_y <= max_height && abs_x >= 0 && abs_x <= max_width ) { PelYline_11 = FastLine16Y_11; } else { PelYline_11 = UMVLine16Y_11; } } orgptr = orig_blocks; bindex = 0; for (blky = 0; blky < 4; blky++) // SECOND FOR LOOP { LineSadBlk0 = LineSadBlk1 = LineSadBlk2 = LineSadBlk3 = 0; for (y = 0; y < 4; y++) //INNERMOST FOR LOOP WHICH I WANT TO PARALLELIZE { refptr = PelYline_11 (ref_pic, abs_y++, abs_x, img_height, img_width); LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; } block_sad[bindex++][pos] = LineSadBlk0; block_sad[bindex++][pos] = LineSadBlk1; block_sad[bindex++][pos] = LineSadBlk2; block_sad[bindex++][pos] = LineSadBlk3; } } //===== combine SAD's for larger block types ===== SetupLargerBlocks (list, ref, max_pos); //===== set flag marking that search setup have been done ===== search_setup_done[list][ref] = 1; } #endif // _FAST_FULL_ME_

Read the article
How to Profile R Code that Includes SNOW Cluster

- by James

Hi, I have a nested loop that I'm using foreach, DoSNOW, and a SNOW socket cluster to solve for. How should I go about profiling the code to make sure I'm not doing something grossly inefficient. Also is there anyway to measure the data flows going between the master and nodes in a Snow cluster? Thanks, James

Read the article
How to generate makefile targets from variables?

- by Ketil

I currently have a makefile to process some data. The makefile gets the inputs to the data processing by sourcing a CONFIG file, which defines the input data in a variable. Currently, I symlink the input files to a local directory, i.e. the makefile contains: tmp/%.txt: tmp ln -fs $(shell echo $(INPUTS) | tr ' ' '\n' | grep $(patsubst tmp/%,%,$@)) $@ This is not terribly elegant, but appears to work. Is there a better way? Basically, given INPUTS = /foo/bar.txt /zot/snarf.txt I would like to be able to have e.g. %.out: %.txt some command As well as targets to merge results depending on all $(INPUT) files. Also, apart from the kludgosity, the makefile doesn't work correctly with -j, something that is crucial for the analysis to complete in reasonable time. I guess that's a bug in GNU make, but any hints welcome.

Read the article
Is there an existing solution to the multithreaded data structure problem?

- by thr

I've had the need for a multi-threaded data structure that supports these claims: Allows multiple concurrent readers and writers Is sorted Is easy to reason about Fulfilling multiple readers and one writer is a lot easier, but I really would wan't to allow multiple writers. I've been doing research into this area, and I'm aware of ConcurrentSkipList (by Lea based on work by Fraser and Harris) as it's implemented in Java SE 6. I've also implemented my own version of a concurrent Skip List based on A Provably Correct Scalable Concurrent Skip List by Herlihy, Lev, Luchangco and Shavit. These two implementations are developed by people that are light years smarter then me, but I still (somewhat ashamed, because it is amazing work) have to ask the question if these are the two only viable implementations of a concurrent multi reader/writer data structures available today?

Read the article
How do I retrieve the number of processors on C / Linux ?

- by Andrei Ciobanu

I am writing a small C application that use some threads for processing data. I want to be able to know the number of processors on a certain machine, without using system() & in combination to a small script. The only way i can think of is to parse /proc/cpuinfo. Any other useful suggestions ?

Read the article
LinqToSql - Parralel - DataContext and Parralel

- by Gregoire

In .NET 4 and multicore environment, does the linq to sql datacontext object take advantage of the new parallels if we use DataLoadOptions.LoadWith?

Read the article

< Previous Page | 14 15 16 17 18 19 20 21 22 23 24 25 | Next Page >