Search Results

Search found 25785 results on 1032 pages for 'oracles solution for soc'.

Page 151/1032 | < Previous Page | 147 148 149 150 151 152 153 154 155 156 157 158  | Next Page >

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

  • Authorizing a computer to access a web application

    - by HackedByChinese
    I have a web application, and am tasked with adding secure sign-on to bolster security, akin to what Google has added to Google accounts. Use Case Essentially, when a user logs in, we want to detect if the user has previously authorized this computer. If the computer has not been authorized, the user is sent a one-time password (via email, SMS, or phone call) that they must enter, where the user may choose to remember this computer. In the web application, we will track authorized devices, allowing users to see when/where they logged in from that device last, and deauthorize any devices if they so choose. We require a solution that is very light touch (meaning, requiring no client-side software installation), and works with Safari, Chrome, Firefox, and IE 7+ (unfortunately). We will offer x509 security, which provides adequate security, but we still need a solution for customers that can't or won't use x509. My intention is to store authorization information using cookies (or, potentially, using local storage, degrading to flash cookies, and then normal cookies). At First Blush Track two separate values (local data or cookies): a hash representing a secure sign-on token, as well as a device token. Both values are driven (and recorded) by the web application, and dictated to the client. The SSO token is dependent on the device as well as a sequence number. This effectively allows devices to be deauthorized (all SSO tokens become invalid) and mitigates replay (not effectively, though, which is why I'm asking this question) through the use of a sequence number, and uses a nonce. Problem With this solution, it's possible for someone to just copy the SSO and device tokens and use in another request. While the sequence number will help me detect such an abuse and thus deauthorize the device, the detection and response can only happen after the valid device and malicious request both attempt access, which is ample time for damage to be done. I feel like using HMAC would be better. Track the device, the sequence, create a nonce, timestamp, and hash with a private key, then send the hash plus those values as plain text. Server does the same (in addition to validating the device and sequence) and compares. That seems much easier, and much more reliable.... assuming we can securely negotiate, exchange, and store private keys. Question So then, how can I securely negotiate a private key for authorized device, and then securely store that key? Is it more possible, at least, if I settle for storing the private key using local storage or flash cookies and just say it's "good enough"? Or, is there something I can do to my original draft to mitigate the vulnerability I describe?

    Read the article

  • Outputting a bar chart to an iPhone application?

    - by Moddy
    Right, I really want to output a Bar Graph to an Obj-C iPhone application - now I may be missing a vital SDK class or something; but right now my solution is to embed a WebView and inside that have a JQuery/Flot based graph - not totally ideal I know! Just wondering if anybody has any other creative solutions or whether a WebView/AJAX solution is the way to go? (For the record; my data source will be figures returned from an external source - i.e downloaded from the internet. So I was even toying with having a PHP Proxy/script do all the work on that, return the figures AND a graph to the application - but then I risk placing extra strain on the server!)

    Read the article

  • How to get default Ctrl+Tab functionality in WinForms MDI app when hosting WPF UserControls

    - by jpierson
    I have a WinForms based app with traditional MDI implementation within it except that I'm hosting WPF based UserControls via the ElementHost control as the main content for each of my MDI children. This is the solution recommended by Microsoft for achieving MDI with WPF although there are various side effects unfortunately. One of which is that my Ctrl+Tab functionality for tab switching between each MDI child is gone because the tab key seems to be swallowed up by the WPF controls. Is there a simple solution to this that will let the Ctrl+tab key sequences reach my WinForms MDI parent so that I can get the built-in tab switching functionality?

    Read the article

  • Trying to read FormsAuthentication tickets to read in other areas of site

    - by Pasha Aryana
    Hi, NOTE: I have included 3 links in here to my localhost areas but could not submit the post so I seperetaed them with a space character so it would post on stackoverflow. I currently have 2 ASP.NET MVC apps in my solution. First I run the first one by setting it to be startup project. It goes to the login page, from there once the data has been entered I execute the following code: var authTicket = new FormsAuthenticationTicket(1, login.LoginDataContract.MSISDN, DateTime.Now, DateTime.Now.AddMinutes(Convert.ToDouble("30")), true, ""); string cookieContents = FormsAuthentication.Encrypt(authTicket); var cookie = new HttpCookie(FormsAuthentication.FormsCookieName, cookieContents) { Expires = authTicket.Expiration, //Path = FormsAuthentication.FormsCookiePath //Path = "http://localhost" Domain = "" }; if (System.Web.HttpContext.Current != null) { System.Web.HttpContext.Current.Response.Cookies.Add(cookie); } As you can see I have set the Domain = "", so theoretically speaking it should work on any thing under my http: //localhost. Then I have set the persist security of the cookie to true so I can access it from any where under localhost. The cookie writes fine and I get logged in and all godd for now. BTW the url for this login page is: http //localhost/MyAccount/Login Now then I stop the solution and set the other MVC apps to be the startup. Then I run it. The URL for the second site is: http: //localhost/WebActivations/ Here is the code in the other apps start controller: public class HomeController : Controller { public ActionResult Index() { ViewData["Message"] = "Welcome to ASP.NET MVC!"; // PASHA: Added code to read the authorization cookie set at // login in MyAccount *.sln for (int i = 0; i < System.Web.HttpContext.Current.Request.Cookies.Count;i++) { Response.Write(System.Web.HttpContext.Current.Request.Cookies[i].Name + " " + System.Web.HttpContext.Current.Request.Cookies[i].Value); } HttpCookie authorizationCookie = System.Web.HttpContext.Current.Request.Cookies[FormsAuthentication.FormsCookieName.ToString()]; // decrypt. FormsAuthenticationTicket authorizationForm = FormsAuthentication.Decrypt(authorizationCookie.Value); ViewData["Message"] = authorizationForm.UserData[0].ToString(); return View(); } public ActionResult About() { return View(); } The problem is in this Home controller when I run the solution it cannot read the authentication cookie, you see the loop there it does not find the .ASPXAUTH cookie. But once it crashes in Firefox I have a look in the Page Info and then security and Cookies and its there and its the same cookie. What am I doing wrong?

    Read the article

  • How do I debug a DLL from VS2008?

    - by GregH
    I have a program written in VB.Net (Visual Studio 2008) that uses a DLL written in Visual C++ by another developer. I'd like to be able to step in to the C++ code as my code makes calls to methods in the DLL. Since the DLL is it's own solution, I don't think it can be included in my solution/project. I tried putting the DLLs pdb file in the debug/bin directory with the rest of my build and pdb files. However, when I get to the point in stepping through my code, and it gets to the dll call, it just steps right over the dll code. Do I have to manually load symbols? Not sure what I'm doing wrong. Thanks.

    Read the article

  • "Go To Definition" in Visual Studio only brings up the Metadata

    - by pfunk
    I am working in a Web Project in Visual Studio 2008. When I hit F12 (or right-click and select Go To Definition) Visual Studio is consistently going to the Meta data file instead of going to the source. Some Points: All the source code is C#, there is no VB.Net All the projects are in the same solution Yes, everything is a project reference (checked and double-checked) I have tried the Clean/Rebuild Solution approach (even to the point of clearing out the Temp directory, Temporary ASP.NET Files directory, etc). Has anyone else seen this behavior and/or know how to fix it?

    Read the article

  • Multi-threading mechanisms to run some lengthy operations from winforms code and communication with

    - by tmarouda
    What do I want to achieve: I want to perform some time consuming operations from my MDI winforms application (C# - .NET). An MDI child form may create the thread with the operation, which may take long time (from 0.1 seconds, to even half hour) to complete. In the meantime I want the UI to respond to user actions, including manipulation of data in some other MDI child form. When the operation completes, the thread should notify the MDI child that the calculations are done, so that the MDI child can perform the post-processing. How can I achieve this: Should I use explicit threading (i.e., create explicit threads), thread pools? Or simply just propose your solution. Should I create foreground or background threads? And how does the thread communicates with the GUI, according the solution you propose? If you know of a working example that handles a similar situation, please make a note.

    Read the article

  • Controllling Drupal's active/active-trail with duplicate menu items

    - by Mark
    I'm developing a site that requires some duplication of links within the menu: Section A -- Introduction -- Testimonials Section B -- Introduction -- Testimonials Testimonials -- Section A -- Section B So 'Section A Testimonials' and 'Testimonials Section A' point to the same node. But regardless of which menu link people use, I want the person to be in Section A. The problem is that D6 doesn't like duplicate menu items, and it assigns the active and active-trail classes rather unpredictably. So my thought was to create a placeholder node for each item in the Testimonials menu, and then set the URL to something like "testimonials/redirect/section-a", and then use mod_rewrite to redirect over to "section-a/testimonials". With this solution, I will have no duplicate paths in the menu. I'm just hoping this doesn't somehow hurt my SEO. Does anyone know a better solution?

    Read the article

  • Can you do icon overlays using Java on Windows OS

    - by Saviz
    I would like to manupilate badges or icon overlays using Java on windows. Basically some files on the drive to have overlays depending what state those files are in. This should be visible through windows explorer. Something simillar to how DropBox does things. Is that possible? I've seen several articles on this topic but none of them use Java. They all seem to use C++ or C# or COM objects. I was looking for a Java solution for windows. Of course I'd like to have a Java solution on Mac's too. Not sure if this is possible but before I give up I thought I ask.

    Read the article

  • HTML table headers always visible at top of window when viewing a large table

    - by Craig McQueen
    I would like to be able to "tweak" an HTML table's presentation to add a single feature: when scrolling down through the page so that the table is on the screen but the header rows are off-screen, I would like the headers to remain visible at the top of the viewing area. This would be conceptually like the "freeze panes" feature in Excel. However, an HTML page might contain several tables in it and I only would want it to happen for the table that is currently in-view, only while it is in-view. Note: I've seen one solution where the table data area is made scrollable while the headers do not scroll. That's not the solution I'm looking for.

    Read the article

  • Will OpenGL give me any FPS improvement over CoreAnimation for scrolling a large image?

    - by Ben Roberts
    Hi, I'm considering re-writing the menu system of my iPhone app to use Open GL just to improve the smoothness of scrolling a big image (480x1900px) across the screen. I'm looking at doing this as a way to improve on using the method/solution as described here (http://stackoverflow.com/questions/1443140/smoother-uiview). This solution was a big improvement over the previous implementation but it's still not perfect and as this is the first thing the user will see I'd like it to be as flawless as possible. Will switching to OpenGL give me the sort of smooth scrolling I'm looking for? I've stayed clear of OpenGL until now as this is my first app and core animation has handled everything else I've thrown at it well enough, would be good to know if this alternative implementation is likely to work! thanks

    Read the article

  • Multiple client projects to one server project w/ Silverlight & RIA Services Beta

    - by Dale Halliwell
    The type or namespace name 'Resources' does not exist in the namespace 'MyWebProject.Web' (are you missing an assembly reference?) C:\Users\...\MySecondProject\Generated_Code\MyWebProject.Web.g.cs I am having some problems trying to add a second SL client project to my (Ria services) SL Business Application. It has to do with the way the shared Resources files on the Web project are linked to from my new SL client project (the SL client project that was generated by the Business App template works fine). The same problem was brought up in the SL forums but copying the Web folder from my existing SL client doesn't seem to work. How can I add a second SL client project using RIA services to the solution of an existing SL Business Application without these problems over shared resources? Should I avoid the Business Application solution template for solutions with multiple SL clients since it seems to presume only a single client app will be sharing the resource files?

    Read the article

  • XML to be validated against multiple xsd schemas

    - by Michael Rusch
    I'm writing the xsd and the code to validate, so I have great control here. I would like to have an upload facility that adds stuff to my application based on an xml file. One part of the xml file should be validated against different schemas based on one of the values in the other part of it. Here's an example to illustrate: <foo> <name>Harold</name> <bar>Alpha</bar> <baz>Mercury</baz> <!-- ... more general info that applies to all foos ... --> <bar-config> <!-- the content here is specific to the bar named "Alpha" --> </bar-config> <baz-config> <!-- the content here is specific to the baz named "Mercury" --> </baz> </foo> In this case, there is some controlled vocabulary for the content of <bar>, and I can handle that part just fine. Then, based on the bar value, the appropriate xml schema should be used to validate the content of bar-config. Similarly for baz and baz-config. The code doing the parsing/validation is written in Java. Not sure how language-dependent the solution will be. Ideally, the solution would permit the xml author to declare the appropriate schema locations and what-not so that s/he could get the xml validated on the fly in a sufficiently smart editor. Also, the possible values for <bar> and <baz> are orthogonal, so I don't want to do this by extension for every possible bar/baz combo. What I mean is, if there are 24 possible bar values/schemas and 8 possible baz values/schemas, I want to be able to write 1 + 24 + 8 = 33 total schemas, instead of 1 * 24 * 8 = 192 total schemas. Also, I'd prefer to NOT break out the bar-config and baz-config into separate xml files if possible. I realize that might make all the problems much easier, as each xml file would have a single schema, but I'm trying to see if there is a good single-xml-file solution.

    Read the article

  • Problem building with MSBuild on Team Build

    - by mrwayne
    Hi, I have recently upgraded from TFS2005 to TFS2010 (and sub-sequently the team build server). I used to be able to get a team build on one of my solutions to work pretty easily, (see structure below) Solution |_Web Site | |_Bin | |_Other Files |_Project 1 |_Project 2 |_Project (n) Now, i can no longer get a build working correctly as it doesnt appear to build all my projects any longer (i've had to create a new build definition). Either that, or its not building the projects in such an order that when it hits project X, that a project it depends on (Project A), has not yet been built, and as such fails. I'm just basically trying to build a web site (not web application project), with some Dependant / linked projects. Why must it be so hard! Everything builds fine in the IDE. If i even open the solution copied to the build server under the 'Sources' directory, i am able to build it fine in the IDE on that server. No such luck with MSBuild though. Thoughts?

    Read the article

  • Hudson leaving open sessions

    - by James Carr
    Does anyone have any experiences with Hudson leaving sessions open to a Subversion server? We've been increasing our job list and got ~50 which poll the SCM regularly. It's been working fine but recently our SCM has started acting up by refusing handshakes, which we suspect is down to the sessions left open by Hudson. Last count there were ~400 sessions with nothing building on Hudson. At the moment the only solution we've found is restarting the Subversion service but this is becoming increasingly frequent and not a long term solution. Any experiences/ideas would be appreciated.

    Read the article

  • Perl: Event-driven Programming

    - by Shiftbit
    Is there any POSIX signals that I could utilize in my perl program to create event-driven programming? Currently I have multi-process program that is able to cross communicate but my parent thread is only able to listen to listen at one child at a time. foreach (@threads) { sysread(${$_}{'read'}, my $line, 100); chomp($line); print "Parent hears: $line\n"; } The problem is that the parent sits in a continual wait state until it receives it a signal from the first child before it can continue on. I am relying on 'pipe' for my intercommunication. My current solution is very similar to: http://stackoverflow.com/questions/2558098/how-can-i-use-pipe-to-facilitate-interprocess-communication-in-perl If possible I would like to rely on a $SIG{...} event or any non-CPAN solution.

    Read the article

  • Python "Every Other Element" Idiom

    - by Matt Luongo
    Hey guys, I feel like I spend a lot of time writing code in Python, but not enough time creating Pythonic code. Recently I ran into a funny little problem that I thought might have an easy, idiomatic solution. Paraphrasing the original, I needed to collect every sequential pair in a list. For example, given the list [1,2,3,4,5,6], I wanted to compute [(1,2),(3,4),(5,6)]. I came up with a quick solution at the time that looked like translated Java. Revisiting the question, the best I could do was l = [1,2,3,4,5,6] [(l[2*x],l[2*x+1]) for x in range(len(l)/2)] which has the side effect of tossing out the last number in the case that the length isn't even. Is there a more idiomatic approach that I'm missing, or is this the best I'm going to get?

    Read the article

  • Copy task in Visual Studio 2008

    - by Maurizio Reginelli
    This is a question related to a previous question I posted (see here). I need to configure a .csproj file to copy some files from a directory to another one (let me call them SOURCE and DESTINATION). I also need to change the SOURCE path depending from the configuration I'm using. For example, if I compile my project in Debug mode, the SOURCE path must contain a subfolder called Debug. I tried the solution proposed by Schmitt in the previous post, using $(ConfigurationName) to set the dynamic directory into the SOURCE path. When I opened the solution containing that project, a list of links to the source files appeared in the main tree of the project and they were correctly related to the Debug mode. But when I changed to the Release mode, I saw that the path of the linked source files were set again to the Debug version. Is there a way to specify a parameter in the Include attribute of the SourceFiles element? Thank you.

    Read the article

  • converting JSON to an object / dictionary / dynamic

    - by benpage
    I'm currently using jqGrid to display data. Part of jqGrid's interface will give you search options, posting back the search details in a JSON string, for example: {"groupOp":"AND","rules":[{"field":"PersonID","op":"eq","data":"123"},{"field":"LastName","op":"eq","data":"Smith"}]} (meaning i'm searching for personID = 123, and LastName = 'Smith') so what i'm hoping to do is somehow convert that back into something i can use server-side. Does anyone have a solution for this that may convert it back into an object of some kind? My current solution would be to convert into xml, parse with linq and create instances of my own 'search' class with a 'rules' collection.

    Read the article

  • “Could not create the file”, “Access is denied” and “Unrecoverable build error” in building setup project in VS 2008

    - by Mister Cook
    When building a setup project I get a message: Error with setup build: Error 27 Could not create the file 'C:\Users\MyName\AppData\Local\Temp\VSI1E1A.tmp' 'Access is denied.' I have tried the following (from http://support.microsoft.com/kb/329214/EN-US) regsvr32 "C:\Program Files (x86)\Common Files\Microsoft Shared\MSI Tools\mergemod.dll" The DLL registers but this does not fix my problem. Also, I tried a Clean build, deleting the temp folder, ran VS2008 as administratror, restart my PC but it occurs every time. I have no anti-virus software running and running on Windows 7 64-bit. This operation worked fine until recently. I have read many other users see this but found no solution. The only half solution I found was to edit setup properties and switch to Package files as Loose uncompressed files. This works but is not ideal as I need a full installer.

    Read the article

  • Diophantine Equation

    - by swapnika
    Write an iterative program that finds the largest number of McNuggets that cannot be bought in exact quantity. Your program should print the answer in the following format (where the correct number is provided in place of n): "Largest number of McNuggets that cannot be bought in exact quantity: n" Hints: Hypothesize possible instances of numbers of McNuggets that cannot be purchased exactly, starting with 1 For each possible instance, called n, 1. Test if there exists non-negative integers a, b, and c, such that 6a+9b+20c = n. (This can be done by looking at all feasible combinations of a, b, and c) 2. If not, n cannot be bought in exact quantity, save n When you have found six consecutive values of n that in fact pass the test of having an exact solution, the last answer that was saved (not the last value of n that had a solution) is the correct answer, since you know by the theorem that any amount larger can also be bought in exact quantity

    Read the article

  • MagicLibrary.dll 32bit convert for 64bit

    - by Nullstr1ng
    Hi Guys, I recently received an email about a guy who like to make my software compatible with 64Bit Windows 7 OS. I have 1 single problem though .. I have followed a steps regarding on compiling VS Solution/Project for 64bit. But when I start to to compile the solution for 64bit, this "Crownwood.Magic.Forms" (MagicLibrary.dll) doesn't work anymore. This is the compilation error message "Error 6 The type or namespace name 'Crownwood' could not be found (are you missing a using directive or an assembly reference?)" Please don't tell me I did not referenced the MagicLibrary.dll It's a bit of confusing because the MagicLibrary.dll is already in my reference list. I also remove and add it back. There was this CorFlags.exe but am not sure how to convert it to 64bit. "CorFlags MagicLibrary.dll /32BIT- /Force" did not work.. :( Please help. Thanks in advance - Jayson

    Read the article

  • Print "1 followed by googolplex number of zeros" [closed]

    - by Rajan
    Assuming we are not concerned about running time of the program (which is practically infinite for human mortals), we want to print out in base 10, the exact value of 10^(googolplex), one digit at a time (mostly zeros). Describe an algorithm (which can be coded on current day computers), or write a program to do this. Since we cannot practically check the output, so we will rely on collective opinion on the correctness of the program. NOTE : I do not know the solution, or whether a solution exists or not. The problem is my own invention. To those readers who think this is not a CS question... kindly reconsider. This is difficult and bit theoretical but definitely CS.

    Read the article

< Previous Page | 147 148 149 150 151 152 153 154 155 156 157 158  | Next Page >