Search Results

Search found 2397 results on 96 pages for 'copying'.

Page 21/96 | < Previous Page | 17 18 19 20 21 22 23 24 25 26 27 28  | Next Page >

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

  • MPI Cluster Debugger launch integration in VS2010

    Let's assume that you have all the HPC bits installed and that you have existing MPI code (or you created a "Hello World" project using the MPI project template). Of course, you create a single MPI application and at runtime it will correspond to multiple processes (of the same app) launched on multiple nodes (i.e. machines) on the cluster. So how do you debug such a situation by simply hitting the familiar "F5" keystroke (i.e. Debug - Start Debugging)?WATCH IT INSTEAD OF READING ABOUT ITIf you can't bear to read through all the details below, just watch this 19-minute screencast explaining this VS2010 feature. Alternatively, or even additionally, keep on reading.REQUIREMENTWhen you debug an MPI application, you would want the copying of resources from your client machine (where Visual Studio is installed) to each compute node (where Windows HPC Server is installed) to take place automatically for you. 'Resources' in the previous sentence includes your application binary, plus any binary or data dependencies it may have, plus PDBs if needed, plus the debug CRT of the correct bitness, plus msvsmon for remote debugging to work. You would also want, after copying is complete, to have your app and msvsmon launched and attached so that you can hit breakpoints back in Visual Studio on your client machine. All these thing that you would want are delivered in VS2010.STEPS TO F51. In your MPI project where you have placed a breakpoint go to Project Properties - Configuration Properties - Debugging. Ensure the "Debugger to launch" combo box value is set to MPI Cluster Debugger.2. There are a whole bunch of properties here and typically you can ignore all of them except one: Run Environment. By default it is set to run 1 process on your local machine and if you change the number after that to, for example, 4 it will launch 4 processes of your app on your local machine.You want this to run on your cluster though, so go to the dropdown arrow at the end of the Run Environment cell and open it to expose the "Edit Hpc node" menu which opens the Node Selector dialog:In this dialog you can enter (or pick from a list) the cluster head node name and then the number of processes you want to execute on the cluster and then hit OK and… you are done.3. Press F5 and watch your breakpoint get hit (after giving it some time for copying, remote execution, attachment and symbol resolution to take place).GOING DEEPERIn the MPI Cluster Debugger project properties above, you can see many additional properties to the Run Environment. They are all optional, but you may want to understand them in order to fine tune your cluster debugging. Read all about each one of these on the MSDN page Configuration Properties for the MPI Cluster Debugger.In the Node Selector dialog above you can see more options than just the Head Node name and Number of Process to run. They should be self-explanatory but I also cover them in depth in my screencast showing you an example of why you would choose to schedule processes per core versus per node. You can also read about these options on MSDN as part of the page How to: Configure and Launch the MPI Cluster Debugger.To read through an example that touches on MPI project creation, project properties, node selector, and also usage of MPI with OpenMP plus MPI with PPL, read the MSDN page Walkthrough: Launching the MPI Cluster Debugger in Visual Studio 2010.Happy MPI debugging! Comments about this post welcome at the original blog.

    Read the article

  • MPI Cluster Debugger launch integration in VS2010

    Let's assume that you have all the HPC bits installed and that you have existing MPI code (or you created a "Hello World" project using the MPI project template). Of course, you create a single MPI application and at runtime it will correspond to multiple processes (of the same app) launched on multiple nodes (i.e. machines) on the cluster. So how do you debug such a situation by simply hitting the familiar "F5" keystroke (i.e. Debug - Start Debugging)?WATCH IT INSTEAD OF READING ABOUT ITIf you can't bear to read through all the details below, just watch this 19-minute screencast explaining this VS2010 feature. Alternatively, or even additionally, keep on reading.REQUIREMENTWhen you debug an MPI application, you would want the copying of resources from your client machine (where Visual Studio is installed) to each compute node (where Windows HPC Server is installed) to take place automatically for you. 'Resources' in the previous sentence includes your application binary, plus any binary or data dependencies it may have, plus PDBs if needed, plus the debug CRT of the correct bitness, plus msvsmon for remote debugging to work. You would also want, after copying is complete, to have your app and msvsmon launched and attached so that you can hit breakpoints back in Visual Studio on your client machine. All these thing that you would want are delivered in VS2010.STEPS TO F51. In your MPI project where you have placed a breakpoint go to Project Properties - Configuration Properties - Debugging. Ensure the "Debugger to launch" combo box value is set to MPI Cluster Debugger.2. There are a whole bunch of properties here and typically you can ignore all of them except one: Run Environment. By default it is set to run 1 process on your local machine and if you change the number after that to, for example, 4 it will launch 4 processes of your app on your local machine.You want this to run on your cluster though, so go to the dropdown arrow at the end of the Run Environment cell and open it to expose the "Edit Hpc node" menu which opens the Node Selector dialog:In this dialog you can enter (or pick from a list) the cluster head node name and then the number of processes you want to execute on the cluster and then hit OK and… you are done.3. Press F5 and watch your breakpoint get hit (after giving it some time for copying, remote execution, attachment and symbol resolution to take place).GOING DEEPERIn the MPI Cluster Debugger project properties above, you can see many additional properties to the Run Environment. They are all optional, but you may want to understand them in order to fine tune your cluster debugging. Read all about each one of these on the MSDN page Configuration Properties for the MPI Cluster Debugger.In the Node Selector dialog above you can see more options than just the Head Node name and Number of Process to run. They should be self-explanatory but I also cover them in depth in my screencast showing you an example of why you would choose to schedule processes per core versus per node. You can also read about these options on MSDN as part of the page How to: Configure and Launch the MPI Cluster Debugger.To read through an example that touches on MPI project creation, project properties, node selector, and also usage of MPI with OpenMP plus MPI with PPL, read the MSDN page Walkthrough: Launching the MPI Cluster Debugger in Visual Studio 2010.Happy MPI debugging! Comments about this post welcome at the original blog.

    Read the article

  • How to safely copy an object?

    - by Prog
    This question is going to be a little long. Please bear with me. Something that happened in a project of mine made me think about how to safely copy objects. I'll present the situation I had and then ask a question. There was a class SomeClass: class SomeClass{ Thing[] things; public SomeClass(Thing[] things){ this.things = things; } // irrelevant stuff omitted public SomeClass copy(){ return new SomeClass(things); } } There was another class Processor that takes SomeClass objects, copies them (via someClassInstance.copy()), manipulates the copy's state, and returns the copy. Here it is: class Processor{ public SomeClass processObject(SomeClass object){ SomeClass copy = object.copy(); manipulateTheCopy(copy); return copy; } // irrelevant stuff omitted } I ran this, and it had bugs. I looked into these bugs, and it turned out that the manipulations Processor does on copy actually affect not only the copy, but also the original SomeClass object that was passed into processObject. I found out that it was because the original and the copy shared state - because the original passed it's field things into the copy when creating it. This made me realize that copying objects is harder than simply instantiating them with the same fields as the original. For the two objects to be completely disconnected, without any shared state, each of the fields passed to the copy also has to be copied. And if that object contains other objects - they have to be copied too. And so on. So basically, in order to be able to actually copy an object, each class in the system must have a copy() method, that also invokes copy() on all of it's fields, and so on. So for example, for copy() in SomeClass to work, it needs to look like this: public SomeClass copy(){ Thing[] copyThings = new Thing[things.length]; for(int i=0; i<things.length; i++) copyThings[i] = things[i].copy(); return new SomeClass(copyThings); } And if Thing has object fields of it's own, than it's own copy() method must be appropriate: class Thing{ Apple apple; Pencil pencil; int number; public Thing(Apple apple, Pencil pencil, int number){ this.apple = apple; this.pencil = pencil; this.number = number; } public Thing copy(){ // 'number' is a primitve. return new Thing(apple.getCopy(), pencil.getCopy(), number); } } And so on. Of course, instead of all classes having a copy() method, the copying mechanism can happen in all of the getters and the constructors of classes (unless places where it isn't suitable, for example when the field points to an external object, not to an object that 'is part' of this object). Still, that means that in order to be able to safely copy an object - most classes would have to have copying mechanisms in their getters. My question is divided into two parts: How frequently do you need to get a copy of an object? Is this a regular issue? Is the technique described common and/or reasonable? Or is there a better way to make safe copies of objects? Or is there an easier way to safely copy objects, without them sharing any state?

    Read the article

  • How do I stop Python install on Mac OS X from putting things in my home directory?

    - by Rob
    Hi, I'm trying to install Python from source on my Mac. (OS X 10.6.2, Python-2.6.5.tar.bz2) I've done this before and it was easy, but for some reason, this time after ./configure, and make, the sudo make install puts things some things in my home directory instead of in /usr/local/... where I expect. The .py files are okay, but not the .so files... RobsMac Python-2.6.5 $ sudo make install [...] /usr/bin/install -c -m 644 ./Lib/anydbm.py /usr/local/lib/python2.6 /usr/bin/install -c -m 644 ./Lib/ast.py /usr/local/lib/python2.6 /usr/bin/install -c -m 644 ./Lib/asynchat.py /usr/local/lib/python2.6 [...] running build_scripts running install_lib creating /Users/rob/Library/Python creating /Users/rob/Library/Python/2.6 creating /Users/rob/Library/Python/2.6/site-packages copying build/lib.macosx-10.4-x86_64-2.6/_AE.so - /Users/rob/Library/ Python/2.6/site-packages copying build/lib.macosx-10.4-x86_64-2.6/_AH.so - /Users/rob/Library/ Python/2.6/site-packages copying build/lib.macosx-10.4-x86_64-2.6/_App.so - /Users/rob/Library/ Python/2.6/site-packages [...] Later, this causes imports that require those .so files to fail. For example... RobsMac Python-2.6.5 $ python Python 2.6.5 (r265:79063, Apr 28 2010, 13:40:18) [GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import zlib Traceback (most recent call last):     File "", line 1, in ImportError: No module named zlib Any ideas what is wrong? thanks, Rob

    Read the article

  • What strategy do you use to sync your code when working from home

    - by Ben Daniel
    At my work I currently have my development environment inside a Virtual Machine. When I need to do work from home I copy my VM and any databases I need onto a laptop drive sized external USB drive. After about 10 minutes of copying I put the drive in my pocket and head home, copy back the VM and databases onto my personal computer and I'm ready to work. I follow the same steps to take the work back with me. So if I count the total amount of time I spend waiting around for files to finish copying in order for me to take work home and bring it back again, it comes to around 40 minutes! I do have a VPN connection to my work from home (providing the internet is up at both sites) and a decent internet speed (8mbits down/?up) but I find Remote Desktoping into my work machine laggy enough for me to want to work on my VM directly. So in looking at what other options I have or how I could improve my existing option I'm interested in what strategy you use or recommend to do work at home and keeping your code/environment in sync. EDIT: I'd prefer an option where I don't have to commit my changes into version control before I leave work - as I like to make meaningful descriptive comments in my commits, committing would take longer than just copying my VM onto a portable drive! lol Also I'd prefer a solution where my dev environment stays in sync too. Having said that I'm still very interested in your own solutions even if they don't exactly solve my problem as best as I'd like. :)

    Read the article

  • Publish Maven artifacts on FTP with Hudson FTP Publisher Plugin

    - by jaguard
    I'm building a number of artefacts (zip files for different environments: test, dev) using the maven-assembly-plugin using a specialized Maven profile. These artefacts I want to copy/collect on on a FTP server keeping the version (01.07.10.16.Wed-1626) as a folder, so I need to copy from test/build/01.07.10.16.Wed-1626/ to ftp://my-ftp-server:21/projects/myserver-1.7/01.07.10.16.Wed-1626/ The layout for the Maven output is this: target/ build/ 01.07.10.16.Wed-1626/ my-server-01.07.10.16.Wed-1626-dev.zip my-server-01.07.10.16.Wed-1626-test.zip For copying the artefacts I'm using FTP Publisher Plugin but it seams I miss something since that even the build is OK and the artefacts are build without problem but the job is finishing without copying the artefacts, and in the console there is no log info about copying the artefacts My FTP publisher config (FTP repository hosts) is: Hostname: my-ftp-server Port: 21 Timeout: 10000 Root Repository Path: projects User Name: my-user Password: my-pass My Hudson job FTP publisher config (Publish artifacts to FTP) is: FTP site: my-ftp-server Files to upload Source: target/build/** Destination: myserver-1.7 1: There is any log to check if there are any FTP copy errors ? 2: There is any problem with the file pattern (source) or with the dest ?

    Read the article

  • How to return ArrayList results from an IntentService

    - by gcl1
    I have an IntentService that loads up an ArrayList with data from a network source (AWS SDB tables). The ArrayList is in a global space -- accessible to both the calling Activity and the IntentService (like this: appState = ((App)getApplicationContext())). When the IntentService is done, it notifies the Activity through a ResultReceiver, and the Activity calls adapter.notifyDataChanged() to update the ListView. This solution works most of the time, ... but it violates the rule that only the UI thread should make changes to data underlying a ListView. So as it is, I sometimes get an error: "The content of the adapter has changed but ListView did not receive a notification." I think this must be a common situation. Please let me know if you have any suggestions or best practices for this problem. Here are three options I'm aware of: Keep the IntentService, and have it store the results in another "working" ArrayList, also in the global space. When the result is ready, the IntentService calls the ResultReceiver (on the UI thread), which can then: a) copy the result to the ArrayList associated with the ListView, and b) call adapter.notifyDataChanged(). CONS: I don't like the idea of putting temp/working data in a global space, and copying the result list seems inefficient. Keep the IntentService, and have it pass the results back through a bundle loaded with a ParcelableArrayList. CONS: I'm not sure if this approach would scale for very large result sets. It also requires copying the result list. Switch to a Service which builds a local copy of the result list. Have the Activity directly access the address space of the Service in order to read the result list. CON: Still requires copying results to the ArrayList associated with the ListView. Thank you.

    Read the article

  • How to copy files without slowing down my app?

    - by Kevin Gebhardt
    I have a bunch of little files in my assets which need to be copied to the SD-card on the first start of my App. The copy code i got from here placed in an IntentService works like a charm. However, when I start to copy many litte files, the whole app gets increddible slow (I'm not really sure why by the way), which is a really bad experience for the user on first start. As I realised other apps running normal in that time, I tried to start a child process for the service, which didn't work, as I can't acess my assets from another process as far as I understood. Has anybody out there an idea how a) to copy the files without blocking my app b) to get through to my assets from a private process (process=":myOtherProcess" in Manifest) or c) solve the problem in a complete different way Edit: To make this clearer: The copying allready takes place in a seperate thread (started automaticaly by IntentService). The problem is not to separate the task of copying but that the copying in a dedicated thread somehow affects the rest of the app (e.g. blocking to many app-specific resources?) but not other apps (so it's not blocking the whole CPU or someting) Edit2: Problem solved, it turns out, there wasn't really a problem. See my answer below.

    Read the article

  • Nginx ssl - SSL: error:0906D06C:PEM routines:PEM_read_bio:no start line

    - by Alex
    I am trying to enable ssl on a server using a certificate from 123-reg but I keep getting this error: nginx: [emerg] SSL_CTX_use_certificate_chain_file("/opt/nginx/conf/cleantechlms.crt") failed (SSL: error:0906D06C:PEM routines:PEM_read_bio:no start line error:140DC009:SSL routines:SSL_CTX_use_certificate_chain_file:PEM lib) This is my nginx config: server { listen 443; server_name a-fake-url.com; root /file/path/public; passenger_enabled on; ssl on; ssl_certificate /opt/nginx/conf/cleantechlms.crt; ssl_certificate_key /opt/nginx/conf/cleantechlms.key; } I have tried setting my crt and key to full file permissions but there is no difference. My crt file is the crt I was issued concatenated with the ca crt. Update I have tried copying both the keys in sperate files and then running 'cat mykey.crt ca.cert' Also I tried manually copying the keys into the same file. Any ideas?

    Read the article

  • How to abort robocopy on first error

    - by Yurik
    When using robocopy windows utility, what flags do I set so that robocopy aborts on the very first error it sees, similar to xcopy /dry command? I need to mirror two dirs, and on occasion some files would be locked. I do not want robocopy to continue trying to copy files, or override the files that are not locked - rather the very first error should stop the whole copy process. UPDATE: I already have the /R set to 0 - unfortunately that it only applies to a single file, NOT to the whole copying process. Hence, the first file is ignored (instead of stopping the copying), but subsequent files are copied.

    Read the article

  • A problem on Windows 7. Is it a bug?

    - by yihang
    I am running Windows 7 Ultimate. I tried to copy something into my Pendrive. The copying windows will appear. Now, try to click on the small icon on the left of the title bar. A context menu will appear. Click on the icon again. The context menu will disappear but the copying process is also canceled. So, do you have this problem? Is it a bug?

    Read the article

  • Files Corrupted on System Restore

    - by Yar
    I restored my OSX today by copying the system over from a backup. Most things seem to be working, but every single GIT repo gives pretty much the same error fatal: object 03b45161eb27228914e690e032ca8009358e9588 is corrupted I have tried chowning, doing everything as sudo or root... I have no idea what to try next. This would be a normal git question except that it's on many repos. Ideas? Note: I'm using git 1.7.0.3 and I was probably using 1.7.0 before. Edit: Tried with 1.7.0.2 and it made no difference. Edit: Even when copying any of the repos I get this strange message cp: .git/objects/fe/86b676974a44aa7f128a55bf27670f4a1073ca: could not copy extended attributes to /eraseme/Pickers/.git/objects/fe/86b676974a44aa7f128a55bf27670f4a1073ca: Operation not permitted

    Read the article

  • What is the data transfer speeds within the disk, to other devices?

    - by Kumar
    I use Debian 6 on a HP Elitbook 6930 with 2Gigs of RAM. I was copying two AVI files, 1.5 GB in total, and noticed that the data copying was done at the rate of 4MB/sec. When I copy same AVIs to my Western Digital Passport 25G USB plugin drive the data transfer speed is 11+MB/sec. This speed is different if I plugin the drive to different USB ports. Interestingly, at work I also downloaded a 16MB IE8 exe on my virtual xp, run inside Oracle Sun Virtual Box, and it was downloaded AND saved within a second. We have a high speed network at work. :-) Why and how this difference in data speeds is possible?

    Read the article

  • Copy and Paste without needing to use the Mouse

    - by Paul Farry
    In previous versions of windows when your were Copying Files (Ctrl-C) then alt-Tab (to the appropriate window) and Pasting (Ctrl-V) using the Keyboard everything could be driven by the keyboard. With Vista and 7, it seems if you are copying and pasting (and the files exist) in the Destination Location you have to Click the Copy and Replace, there doesn't seem to be a way to do it... "Do this for the next {n} Conflicts" can be driven by Alt-D, but the "Copy and Replace" and "Don't Copy" are not. Or is there and have I just missed something? I'm more than happy to change registry or something to enable Keyboard shortcuts for this.

    Read the article

  • F5 BigIP upgrade from 9.x to 10.x

    - by mbuk2k
    Having a few difficulties upgrading a Big IP 3400 from 9.4.8 to any version 10.x image. The following are the versions I've tried: 10.1.0.3341.0 10.2.2.763.3 10.2.3.112.0 10.2.4.577.0 To upgrade I'm running the following command: image2disk --format=volumes BIGIP-10.1.0.3341.0.iso Obviously replacing the version number with the relevant image I'm trying to upgrade to each time. The F5 reboots, and starts copying packages however after 30 seconds or so just stops copying. The cursor in the console is still flashing but no matter how long it's left, the package doesn't copy. It seems to be a different package with each version/image (but always the same package per version) at point of freezing, which I'm guessing is suggesting a space issue? I've checked free space on the device and it has over 2GB free at root which should surely be enough? If anyone has any advice or pointers, it would be kindly appreciated. Thank you

    Read the article

  • Server freeze (Disk I/O possibly)

    - by user973917
    I have a Windows Server 2008 machine that is resyncing disks after a powerloss. The issue is that the system becomes unresponsive after about 10 minutes. We've checked with resource monitor and found that the CPU's aren't maxed; but the disk I/O is well over 250MB/s. We've attempted copying data from 1 disk to another; bypassing syncing of disks and this too causes the machine to freeze after about 10 minutes of copying data. I have attempted to let the machine resync the disks for a few days with the machine on in this "frozen" state. By frozen I mean that NOTHING works on the machine, it's completely unresponsive; no mouse movement, etc. I want to know how I would go about definitively checking if this is Disk I/O that is freezing the system. I know that disk I/O can freeze a system; but what can I use to run tests to be sure?

    Read the article

  • Unable to restore from Shadow Copy due to long filename

    - by Spongeboy
    We have shadow copy enabled on our Windows SBS 2008 server. Attempting to restore a file from shadow copy gave the following error- The source file name(s) are larger than is supported by the file system. Try moving to a location which has a shorter path name, or try renaming to shorter name(s) before attempting this operation. The filename has 67 characters, and it's shadow copy path is 170 characters. These seem to be under the NTFS limits (260?). We tried- Copying to the shortest path possible (C:) Copying to the shortest path possible on both a client computer and the server itself Is it possible to rename files in a shadow copy, before doing the copy? Any idea why the error is appearing despite the filename size appearing to be within limits?

    Read the article

  • MacOS X 10.6 Portable Home Directory sync fails due to FileSync agent crashing

    - by tegbains
    On one of our cleanly installed MacPro machines running MacOS X 10.6.6 connected to our MacOS X 10.6.6 Server, syncing data using Portable Home Directories fails. It seems to be due to the filesync agent crashing during the home sync. We get -41 and -8026 errors, which we are suspecting are indicating that there is too much data or filesync agent can't read the files. The user is the owner of the files and can read/write to all of the files. < Logout 0:: [11/02/04 13:10:42.751] Error -41 copying /Volumes/RCAUsers/earlpeng/Library/Mail/Mailboxes/email from old imac./Attachments/12081/2.2. (source = NO) < Logout 0:: [11/02/04 13:10:42.758] Error -8062 copying /Volumes/RCAUsers/earlpeng/Library/Mail/Mailboxes/email from old imac./Attachments/12081/2.2/[email protected]. (source = NO) < Logout 1:: [11/02/04 13:10:42.758] -[DeepCopyContext deepCopyError:sourceError:sourceRef:]: error = -8062, wasSource = NO: return shouldContinue = NO

    Read the article

  • Using Gentoo's `ebegin`, `eend` etc under Ubuntu

    - by Marcus Downing
    We're quite fond of the style of the ebegin, eend, eerror, eindent etc commands used by Portage and other tools on Gentoo. The green-yellow-red bullets and standard layout make for very quick spotting of errors, on what would otherwise be very grey command line output. #!/bin/sh source /etc/init.d/functions.sh ebegin "Copying data" rsync .... eend $? Producing output similar to: * Copying data... [ OK ] As a result we're using these commands in some of our common shell scripts, which is a problem for the people using Ubuntu and other linuxes. (linuces? linuxen? linucae? other distros) On Gentoo these functions are provided by OpenRC, and imported with functions.sh file (whose exact position seems to vary slightly). But is there a simple way of getting these commands on Ubuntu? In theory we could replace them all with dull echos, but we'd rather not?

    Read the article

  • Sharepoint 2007 - Transaction log full

    - by Kenny Bones
    So I have this SharePoint 2007 site that is basically trash. I'm supposed to just toss it, but I'm in need of copying all of the data in form of traditional files and folders from certain projects. And since the transaction log is full, it's so damn slow. Even opening SharePoint takes up to 15 minutes, or it won't open at all. Copying of files is extremely slow. So I'm in need of a quick fix here. Just to be able to copy out some files and folders. I don't need to fix the problem per se. What can I do to fix it temporarily to be able to copy out the data?

    Read the article

  • MacOS X 10.6 Portable Home Directory sync fails due to FileSync agent crashing

    - by tegbains
    On one of our cleanly installed MacPro machines running MacOS X 10.6.6 connected to our MacOS X 10.6.6 Server, syncing data using Portable Home Directories fails. It seems to be due to the filesync agent crashing during the home sync. We get -41 and -8026 errors, which we are suspecting are indicating that there is too much data or filesync agent can't read the files. The user is the owner of the files and can read/write to all of the files. < Logout 0:: [11/02/04 13:10:42.751] Error -41 copying /Volumes/RCAUsers/earlpeng/Library/Mail/Mailboxes/email from old imac./Attachments/12081/2.2. (source = NO) < Logout 0:: [11/02/04 13:10:42.758] Error -8062 copying /Volumes/RCAUsers/earlpeng/Library/Mail/Mailboxes/email from old imac./Attachments/12081/2.2/[email protected]. (source = NO) < Logout 1:: [11/02/04 13:10:42.758] -[DeepCopyContext deepCopyError:sourceError:sourceRef:]: error = -8062, wasSource = NO: return shouldContinue = NO

    Read the article

  • Move/migrate vbox virtual machine to another computer along with snapshots

    - by user53864
    I have sun virtualbox(2.2) installed on windows xp and I've some VMs(ubuntu server) running with many snapshots . How do I move virtualbox VDIs along with all the snapshots to new windows system?. I tried copying the base vdi and creating it newly on a new system and it worked fine but could not move the snapshots of the vms. I even tried copying Virtualbox.xml and machine's xml file to the new system and manually editing the snapshots location to the new place in the virtualbox.xml file, the snapshots are listed in the snapshots tab(but when I checked snapshots location in the vm's settings it's showing the default location having edited it) but when I try to use the snapshots, reverting to current, it's not booting(some snapshots giving grub 2 error, some got stalled at the boot). I also tried export and import but only current state I could export and use it on the new system with no snapshots. Is there any way I could achieve this?. Please need help...

    Read the article

  • slow disk writes between host and guest

    - by Jure1873
    I've got a ubuntu (server kernel) on a amd x4, 4gb ram, 2x seagate sata 1 tb disks for testing virtual machines and the write performance is very slow. The two disks are in a software raid1 array, one small boot ext3 partition, 10gb system partition and the rest is a xfs partition (about 980) gb for data (virtual machines). If I'm copying files from the virtual machine to the host with rsync or scp the copy frequently stalls or goes at about 1mb/s. What's wrong? I've tried disabling barriers on xfs, increased logbufs, allocsize, but it seems nothing helps. The strange thing is that await (for example during copying) for sda is usually under 100, while for sdb is around 400. Any ideas on what could be wrong / what could I do to improve this setup?

    Read the article

< Previous Page | 17 18 19 20 21 22 23 24 25 26 27 28  | Next Page >