Search Results

Search found 27207 results on 1089 pages for 'preferred solution'.

Page 175/1089 | < Previous Page | 171 172 173 174 175 176 177 178 179 180 181 182  | Next Page >

  • Avoid implicit conversion from date to timestamp for selects with Oracle 11g using Hibernate

    - by sapporo
    I'm using Hibernate 3.2.1 criteria queries to select rows from an Oracle 11g database, filtering by a timestamp field. The field in question is of type java.util.Date in Java, and DATE in Oracle. It turns out that the field gets mapped to java.sql.Timestamp, and Oracle converts all rows to TIMESTAMP before comparing to the passed in value, bypassing the index and thereby ruining performance. One solution would be to use Hibernate's sqlRestriction() along with Oracle's TO_DATE function. That would fix performance, but requires rewriting the application code (lots of queries). So is there a more elegant solution? Since Hibernate already does type mapping, could it be configured to do the right thing?

    Read the article

  • Eclipse CDT on Snow Leopard cannot find binaries

    - by ejel
    After upgraded to Snow Leopard, I can no longer run Eclipse CDT project on my computer. While the build process completes without any error, Eclipse does not recognize the binary file it created. When try to point to the binary file in Run Configuration.. dialog, it cannot find any binary in the project. Though executing the file from Terminal works fine. According to a post at on Eclipse forum, this might be a problem that Mach-O parser does not recognize 64-bit binaries. Does anyone know what are the solutions or workarounds to the problem so that I can run/debug my C++ projects on Snow Leopard. UPDATED The solution suggested by Shane, though allowing the binary created to be recognized, does introduce another problem. Since system libraries in Snow Leopard are all 64 bits, it is no longer possible to link the code created with -arch i386 with these libraries, and hence not a feasible solution yet.

    Read the article

  • CRM 2011 XAML Workflows in VS 2012

    - by AlexR
    I'm writing this knowing there's another topic like this to be found here the answer provided however does not make much sense to me and whatever I tried doesn't seem to work so I'm seeking clarification. The situation is as follows: * New CRM 2011 toolkit solution * New project added for Xaml workflow * Adding CRM "Workflow" activity causes error The error: It shows a redbox "Could not generate view for Workflow" The Exception: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. --- System.IO.IOException: Cannot locate resource 'workflowdesigner.xaml' I have VS 2012 SP2 with the latest (June) CRM SDK Toolkit installed. I did a clean install of the toolkit so it's not "upgraded" to prevent any old assemblies being referenced. I tried the things from the referenced question and moved the assemblies over and also re-referenced the latest assemblies to make sure everything works fine. I also tried downloading a workflow XAML created in CRM and opening but it gives the same error. I also opened the "example" workflow XAML solution in the SDK, again same result.

    Read the article

  • Why isn't Expression Blend rendering my User Control? It's only showing XAML.

    - by unforgiven3
    I'm opening valid XAML within my VS2008 solution in Expression Blend 3 and it is only showing XAML when I try to open individual XAML files. My solution/projects all build and run correctly. When I go to View - Active Document View the Design View, Split View and XAML View options are all grayed out... which doesn't make much sense. I'm not much of a Blend user, but this has never happened before, and I'm coming up blank for how to fix it. Any ideas?

    Read the article

  • How do I start a second console application in Visual Studio when one is already running

    - by Kettenbach
    Hi All, I am working through some examples in a WCF book. There is a Host project and Client project within a single solution. Both are console applications. The Host is the startup app, but the Client app doesn't seem to open the Console like the book says. Book says while the Host is running, run the Client. The Run button is disabled tho as it is already running. The book example definitely has them in the same solution and a single instance of Visual Studio. Anyways, what am I missing here? I have done this with two instances of VS, but I truly have never does this in a single instance. Any help is always appreciated. Cheers, ~ck in San Diego

    Read the article

  • Delayed "rendering" of WPF/Silverlight Dependency Properties?

    - by Aardvark
    Is there a way to know the first time a Dependency Property is accessed through XAML binding so I can actually "render" the value of the property when needed? I have an object (class derived from Control) that has several PointCollection Dependency Properties that may contain 100's or 1000's of points. Each property may arrange the points differently for use in different types shapes (Polyline, Polygon, etc - its more complicated then this, but you get the idea). Via a Template different XAML objects use TemplateBinding to access these properties. Since my object uses a Template I never know what XAML shapes may be in use for my object - so I never know what Properties they may or may not bind to. I'd like to only fill-in these PointCollections when they are actually needed. Normally in .NET I'd but some logic in the Property's getter, but these are bypassed by XAML data binding. I need a WPF AND Silverlight compatible solution. I'd love a solution that avoids any additional complexities for the users of my object.

    Read the article

  • Keeping the DI-container usage in the composition root in Silverlight and MVVM

    - by adrian hara
    It's not quite clear to me how I can design so I keep the reference to the DI-container in the composition root for a Silverlight + MVVM application. I have the following simple usage scenario: there's a main view (perhaps a list of items) and an action to open an edit view for one single item. So the main view has to create and show the edit view when the user takes the action (e.g. clicks some button). For this I have the following code: public interface IView { IViewModel ViewModel {get; set;} } Then, for each view that I need to be able to create I have an abstract factory, like so public interface ISomeViewFactory { IView CreateView(); } This factory is then declared a dependency of the "parent" view model, like so: public class SomeParentViewModel { public SomeParentViewModel(ISomeViewFactory viewFactory) { // store it } private void OnSomeUserAction() { IView view = viewFactory.CreateView(); dialogService.ShowDialog(view); } } So all is well until here, no DI-container in sight :). Now comes the implementation of ISomeViewFactory: public class SomeViewFactory : ISomeViewFactory { public IView CreateView() { IView view = new SomeView(); view.ViewModel = ???? } } The "????" part is my problem, because the view model for the view needs to be resolved from the DI-container so it gets its dependencies injected. What I don't know is how I can do this without having a dependency to the DI-container anywhere except the composition root. One possible solution would be to have either a dependency on the view model that gets injected into the factory, like so: public class SomeViewFactory : ISomeViewFactory { public SomeViewFactory(ISomeViewModel viewModel) { // store it } public IView CreateView() { IView view = new SomeView(); view.ViewModel = viewModel; } } While this works, it has the problem that since the whole object graph is wired up "statically" (i.e. the "parent" view model will get an instance of SomeViewFactory, which will get an instance of SomeViewModel, and these will live as long as the "parent" view model lives), the injected view model implementation is stateful and if the user opens the child view twice, the second time the view model will be the same instance and have the state from before. I guess I could work around this with an "Initialize" method or something similar, but it doesn't smell quite right. Another solution might be to wrap the DI-container and have the factories depend on the wrapper, but it'd still be a DI-container "in disguise" there :) Any thoughts on this are greatly appreciated. Also, please forgive any mistakes or rule-breaking, since this is my first post on stackoverflow :) Thanks! ps: my current solution is that the factories know about the DI-container, and it's only them and the composition root that have this dependency.

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

  • Authorizing a computer to access a web application

    - by HackedByChinese
    I have a web application, and am tasked with adding secure sign-on to bolster security, akin to what Google has added to Google accounts. Use Case Essentially, when a user logs in, we want to detect if the user has previously authorized this computer. If the computer has not been authorized, the user is sent a one-time password (via email, SMS, or phone call) that they must enter, where the user may choose to remember this computer. In the web application, we will track authorized devices, allowing users to see when/where they logged in from that device last, and deauthorize any devices if they so choose. We require a solution that is very light touch (meaning, requiring no client-side software installation), and works with Safari, Chrome, Firefox, and IE 7+ (unfortunately). We will offer x509 security, which provides adequate security, but we still need a solution for customers that can't or won't use x509. My intention is to store authorization information using cookies (or, potentially, using local storage, degrading to flash cookies, and then normal cookies). At First Blush Track two separate values (local data or cookies): a hash representing a secure sign-on token, as well as a device token. Both values are driven (and recorded) by the web application, and dictated to the client. The SSO token is dependent on the device as well as a sequence number. This effectively allows devices to be deauthorized (all SSO tokens become invalid) and mitigates replay (not effectively, though, which is why I'm asking this question) through the use of a sequence number, and uses a nonce. Problem With this solution, it's possible for someone to just copy the SSO and device tokens and use in another request. While the sequence number will help me detect such an abuse and thus deauthorize the device, the detection and response can only happen after the valid device and malicious request both attempt access, which is ample time for damage to be done. I feel like using HMAC would be better. Track the device, the sequence, create a nonce, timestamp, and hash with a private key, then send the hash plus those values as plain text. Server does the same (in addition to validating the device and sequence) and compares. That seems much easier, and much more reliable.... assuming we can securely negotiate, exchange, and store private keys. Question So then, how can I securely negotiate a private key for authorized device, and then securely store that key? Is it more possible, at least, if I settle for storing the private key using local storage or flash cookies and just say it's "good enough"? Or, is there something I can do to my original draft to mitigate the vulnerability I describe?

    Read the article

  • Outputting a bar chart to an iPhone application?

    - by Moddy
    Right, I really want to output a Bar Graph to an Obj-C iPhone application - now I may be missing a vital SDK class or something; but right now my solution is to embed a WebView and inside that have a JQuery/Flot based graph - not totally ideal I know! Just wondering if anybody has any other creative solutions or whether a WebView/AJAX solution is the way to go? (For the record; my data source will be figures returned from an external source - i.e downloaded from the internet. So I was even toying with having a PHP Proxy/script do all the work on that, return the figures AND a graph to the application - but then I risk placing extra strain on the server!)

    Read the article

  • How to get default Ctrl+Tab functionality in WinForms MDI app when hosting WPF UserControls

    - by jpierson
    I have a WinForms based app with traditional MDI implementation within it except that I'm hosting WPF based UserControls via the ElementHost control as the main content for each of my MDI children. This is the solution recommended by Microsoft for achieving MDI with WPF although there are various side effects unfortunately. One of which is that my Ctrl+Tab functionality for tab switching between each MDI child is gone because the tab key seems to be swallowed up by the WPF controls. Is there a simple solution to this that will let the Ctrl+tab key sequences reach my WinForms MDI parent so that I can get the built-in tab switching functionality?

    Read the article

  • Trying to read FormsAuthentication tickets to read in other areas of site

    - by Pasha Aryana
    Hi, NOTE: I have included 3 links in here to my localhost areas but could not submit the post so I seperetaed them with a space character so it would post on stackoverflow. I currently have 2 ASP.NET MVC apps in my solution. First I run the first one by setting it to be startup project. It goes to the login page, from there once the data has been entered I execute the following code: var authTicket = new FormsAuthenticationTicket(1, login.LoginDataContract.MSISDN, DateTime.Now, DateTime.Now.AddMinutes(Convert.ToDouble("30")), true, ""); string cookieContents = FormsAuthentication.Encrypt(authTicket); var cookie = new HttpCookie(FormsAuthentication.FormsCookieName, cookieContents) { Expires = authTicket.Expiration, //Path = FormsAuthentication.FormsCookiePath //Path = "http://localhost" Domain = "" }; if (System.Web.HttpContext.Current != null) { System.Web.HttpContext.Current.Response.Cookies.Add(cookie); } As you can see I have set the Domain = "", so theoretically speaking it should work on any thing under my http: //localhost. Then I have set the persist security of the cookie to true so I can access it from any where under localhost. The cookie writes fine and I get logged in and all godd for now. BTW the url for this login page is: http //localhost/MyAccount/Login Now then I stop the solution and set the other MVC apps to be the startup. Then I run it. The URL for the second site is: http: //localhost/WebActivations/ Here is the code in the other apps start controller: public class HomeController : Controller { public ActionResult Index() { ViewData["Message"] = "Welcome to ASP.NET MVC!"; // PASHA: Added code to read the authorization cookie set at // login in MyAccount *.sln for (int i = 0; i < System.Web.HttpContext.Current.Request.Cookies.Count;i++) { Response.Write(System.Web.HttpContext.Current.Request.Cookies[i].Name + " " + System.Web.HttpContext.Current.Request.Cookies[i].Value); } HttpCookie authorizationCookie = System.Web.HttpContext.Current.Request.Cookies[FormsAuthentication.FormsCookieName.ToString()]; // decrypt. FormsAuthenticationTicket authorizationForm = FormsAuthentication.Decrypt(authorizationCookie.Value); ViewData["Message"] = authorizationForm.UserData[0].ToString(); return View(); } public ActionResult About() { return View(); } The problem is in this Home controller when I run the solution it cannot read the authentication cookie, you see the loop there it does not find the .ASPXAUTH cookie. But once it crashes in Firefox I have a look in the Page Info and then security and Cookies and its there and its the same cookie. What am I doing wrong?

    Read the article

  • How do I debug a DLL from VS2008?

    - by GregH
    I have a program written in VB.Net (Visual Studio 2008) that uses a DLL written in Visual C++ by another developer. I'd like to be able to step in to the C++ code as my code makes calls to methods in the DLL. Since the DLL is it's own solution, I don't think it can be included in my solution/project. I tried putting the DLLs pdb file in the debug/bin directory with the rest of my build and pdb files. However, when I get to the point in stepping through my code, and it gets to the dll call, it just steps right over the dll code. Do I have to manually load symbols? Not sure what I'm doing wrong. Thanks.

    Read the article

  • "Go To Definition" in Visual Studio only brings up the Metadata

    - by pfunk
    I am working in a Web Project in Visual Studio 2008. When I hit F12 (or right-click and select Go To Definition) Visual Studio is consistently going to the Meta data file instead of going to the source. Some Points: All the source code is C#, there is no VB.Net All the projects are in the same solution Yes, everything is a project reference (checked and double-checked) I have tried the Clean/Rebuild Solution approach (even to the point of clearing out the Temp directory, Temporary ASP.NET Files directory, etc). Has anyone else seen this behavior and/or know how to fix it?

    Read the article

  • Multi-threading mechanisms to run some lengthy operations from winforms code and communication with

    - by tmarouda
    What do I want to achieve: I want to perform some time consuming operations from my MDI winforms application (C# - .NET). An MDI child form may create the thread with the operation, which may take long time (from 0.1 seconds, to even half hour) to complete. In the meantime I want the UI to respond to user actions, including manipulation of data in some other MDI child form. When the operation completes, the thread should notify the MDI child that the calculations are done, so that the MDI child can perform the post-processing. How can I achieve this: Should I use explicit threading (i.e., create explicit threads), thread pools? Or simply just propose your solution. Should I create foreground or background threads? And how does the thread communicates with the GUI, according the solution you propose? If you know of a working example that handles a similar situation, please make a note.

    Read the article

  • Controllling Drupal's active/active-trail with duplicate menu items

    - by Mark
    I'm developing a site that requires some duplication of links within the menu: Section A -- Introduction -- Testimonials Section B -- Introduction -- Testimonials Testimonials -- Section A -- Section B So 'Section A Testimonials' and 'Testimonials Section A' point to the same node. But regardless of which menu link people use, I want the person to be in Section A. The problem is that D6 doesn't like duplicate menu items, and it assigns the active and active-trail classes rather unpredictably. So my thought was to create a placeholder node for each item in the Testimonials menu, and then set the URL to something like "testimonials/redirect/section-a", and then use mod_rewrite to redirect over to "section-a/testimonials". With this solution, I will have no duplicate paths in the menu. I'm just hoping this doesn't somehow hurt my SEO. Does anyone know a better solution?

    Read the article

  • Can you do icon overlays using Java on Windows OS

    - by Saviz
    I would like to manupilate badges or icon overlays using Java on windows. Basically some files on the drive to have overlays depending what state those files are in. This should be visible through windows explorer. Something simillar to how DropBox does things. Is that possible? I've seen several articles on this topic but none of them use Java. They all seem to use C++ or C# or COM objects. I was looking for a Java solution for windows. Of course I'd like to have a Java solution on Mac's too. Not sure if this is possible but before I give up I thought I ask.

    Read the article

  • HTML table headers always visible at top of window when viewing a large table

    - by Craig McQueen
    I would like to be able to "tweak" an HTML table's presentation to add a single feature: when scrolling down through the page so that the table is on the screen but the header rows are off-screen, I would like the headers to remain visible at the top of the viewing area. This would be conceptually like the "freeze panes" feature in Excel. However, an HTML page might contain several tables in it and I only would want it to happen for the table that is currently in-view, only while it is in-view. Note: I've seen one solution where the table data area is made scrollable while the headers do not scroll. That's not the solution I'm looking for.

    Read the article

  • Will OpenGL give me any FPS improvement over CoreAnimation for scrolling a large image?

    - by Ben Roberts
    Hi, I'm considering re-writing the menu system of my iPhone app to use Open GL just to improve the smoothness of scrolling a big image (480x1900px) across the screen. I'm looking at doing this as a way to improve on using the method/solution as described here (http://stackoverflow.com/questions/1443140/smoother-uiview). This solution was a big improvement over the previous implementation but it's still not perfect and as this is the first thing the user will see I'd like it to be as flawless as possible. Will switching to OpenGL give me the sort of smooth scrolling I'm looking for? I've stayed clear of OpenGL until now as this is my first app and core animation has handled everything else I've thrown at it well enough, would be good to know if this alternative implementation is likely to work! thanks

    Read the article

  • Multiple client projects to one server project w/ Silverlight & RIA Services Beta

    - by Dale Halliwell
    The type or namespace name 'Resources' does not exist in the namespace 'MyWebProject.Web' (are you missing an assembly reference?) C:\Users\...\MySecondProject\Generated_Code\MyWebProject.Web.g.cs I am having some problems trying to add a second SL client project to my (Ria services) SL Business Application. It has to do with the way the shared Resources files on the Web project are linked to from my new SL client project (the SL client project that was generated by the Business App template works fine). The same problem was brought up in the SL forums but copying the Web folder from my existing SL client doesn't seem to work. How can I add a second SL client project using RIA services to the solution of an existing SL Business Application without these problems over shared resources? Should I avoid the Business Application solution template for solutions with multiple SL clients since it seems to presume only a single client app will be sharing the resource files?

    Read the article

  • XML to be validated against multiple xsd schemas

    - by Michael Rusch
    I'm writing the xsd and the code to validate, so I have great control here. I would like to have an upload facility that adds stuff to my application based on an xml file. One part of the xml file should be validated against different schemas based on one of the values in the other part of it. Here's an example to illustrate: <foo> <name>Harold</name> <bar>Alpha</bar> <baz>Mercury</baz> <!-- ... more general info that applies to all foos ... --> <bar-config> <!-- the content here is specific to the bar named "Alpha" --> </bar-config> <baz-config> <!-- the content here is specific to the baz named "Mercury" --> </baz> </foo> In this case, there is some controlled vocabulary for the content of <bar>, and I can handle that part just fine. Then, based on the bar value, the appropriate xml schema should be used to validate the content of bar-config. Similarly for baz and baz-config. The code doing the parsing/validation is written in Java. Not sure how language-dependent the solution will be. Ideally, the solution would permit the xml author to declare the appropriate schema locations and what-not so that s/he could get the xml validated on the fly in a sufficiently smart editor. Also, the possible values for <bar> and <baz> are orthogonal, so I don't want to do this by extension for every possible bar/baz combo. What I mean is, if there are 24 possible bar values/schemas and 8 possible baz values/schemas, I want to be able to write 1 + 24 + 8 = 33 total schemas, instead of 1 * 24 * 8 = 192 total schemas. Also, I'd prefer to NOT break out the bar-config and baz-config into separate xml files if possible. I realize that might make all the problems much easier, as each xml file would have a single schema, but I'm trying to see if there is a good single-xml-file solution.

    Read the article

  • Problem building with MSBuild on Team Build

    - by mrwayne
    Hi, I have recently upgraded from TFS2005 to TFS2010 (and sub-sequently the team build server). I used to be able to get a team build on one of my solutions to work pretty easily, (see structure below) Solution |_Web Site | |_Bin | |_Other Files |_Project 1 |_Project 2 |_Project (n) Now, i can no longer get a build working correctly as it doesnt appear to build all my projects any longer (i've had to create a new build definition). Either that, or its not building the projects in such an order that when it hits project X, that a project it depends on (Project A), has not yet been built, and as such fails. I'm just basically trying to build a web site (not web application project), with some Dependant / linked projects. Why must it be so hard! Everything builds fine in the IDE. If i even open the solution copied to the build server under the 'Sources' directory, i am able to build it fine in the IDE on that server. No such luck with MSBuild though. Thoughts?

    Read the article

  • Hudson leaving open sessions

    - by James Carr
    Does anyone have any experiences with Hudson leaving sessions open to a Subversion server? We've been increasing our job list and got ~50 which poll the SCM regularly. It's been working fine but recently our SCM has started acting up by refusing handshakes, which we suspect is down to the sessions left open by Hudson. Last count there were ~400 sessions with nothing building on Hudson. At the moment the only solution we've found is restarting the Subversion service but this is becoming increasingly frequent and not a long term solution. Any experiences/ideas would be appreciated.

    Read the article

  • Thread placement policies on NUMA systems - update

    - by Dave
    In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will likely outweigh (a) any potential communication benefits we might achieve by packing our threads more densely onto a subset of the NUMA nodes, and (b) benefits of NUMA affinity between memory allocated by one thread and accessed by other threads. We want our threads spread widely over the system and not packed together. Conceptually, when placing a new thread, the kernel picks the least loaded node NUMA node (the node with lowest aggregate load average), and then the least loaded core on that node, etc. Furthermore, the kernel places threads onto resources -- sockets, cores, pipelines, etc -- without regard to the thread's process membership. That is, initial placement is process-agnostic. Keep reading, though. This description is incorrect. On Solaris 10 on a SPARC T5440 with 4 x T2+ NUMA nodes, if the system is otherwise unloaded and we launch a process that creates 20 compute-bound concurrent threads, then typically we'll see a perfect balance with 5 threads on each node. We see similar behavior on an 8-node x86 x4800 system, where each node has 8 cores and each core is 2-way hyperthreaded. So far so good; this behavior seems in agreement with the policy I described in the 1st paragraph. I recently tried the same experiment on a 4-node T4-4 running Solaris 11. Both the T5440 and T4-4 are 4-node systems that expose 256 logical thread contexts. To my surprise, all 20 threads were placed onto just one NUMA node while the other 3 nodes remained completely idle. I checked the usual suspects such as processor sets inadvertently left around by colleagues, processors left offline, and power management policies, but the system was configured normally. I then launched multiple concurrent instances of the process, and, interestingly, all the threads from the 1st process landed on one node, all the threads from the 2nd process landed on another node, and so on. This happened even if I interleaved thread creating between the processes, so I was relatively sure the effect didn't related to thread creation time, but rather that placement was a function of process membership. I this point I consulted the Solaris sources and talked with folks in the Solaris group. The new Solaris 11 behavior is intentional. The kernel is no longer using a simple maximum dispersal policy, and thread placement is process membership-aware. Now, even if other nodes are completely unloaded, the kernel will still try to pack new threads onto the home lgroup (socket) of the primordial thread until the load average of that node reaches 50%, after which it will pick the next least loaded node as the process's new favorite node for placement. On the T4-4 we have 64 logical thread contexts (strands) per socket (lgroup), so if we launch 48 concurrent threads we will find 32 placed on one node and 16 on some other node. If we launch 64 threads we'll find 32 and 32. That means we can end up with our threads clustered on a small subset of the nodes in a way that's quite different that what we've seen on Solaris 10. So we have a policy that allows process-aware packing but reverts to spreading threads onto other nodes if a node becomes too saturated. It turns out this policy was enabled in Solaris 10, but certain bugs suppressed the mixed packing/spreading behavior. There are configuration variables in /etc/system that allow us to dial the affinity between nascent threads and their primordial thread up and down: see lgrp_expand_proc_thresh, specifically. In the OpenSolaris source code the key routine is mpo_update_tunables(). This method reads the /etc/system variables and sets up some global variables that will subsequently be used by the dispatcher, which calls lgrp_choose() in lgrp.c to place nascent threads. Lgrp_expand_proc_thresh controls how loaded an lgroup must be before we'll consider homing a process's threads to another lgroup. Tune this value lower to have it spread your process's threads out more. To recap, the 'new' policy is as follows. Threads from the same process are packed onto a subset of the strands of a socket (50% for T-series). Once that socket reaches the 50% threshold the kernel then picks another preferred socket for that process. Threads from unrelated processes are spread across sockets. More precisely, different processes may have different preferred sockets (lgroups). Beware that I've simplified and elided details for the purposes of explication. The truth is in the code. Remarks: It's worth noting that initial thread placement is just that. If there's a gross imbalance between the load on different nodes then the kernel will migrate threads to achieve a better and more even distribution over the set of available nodes. Once a thread runs and gains some affinity for a node, however, it becomes "stickier" under the assumption that the thread has residual cache residency on that node, and that memory allocated by that thread resides on that node given the default "first-touch" page-level NUMA allocation policy. Exactly how the various policies interact and which have precedence under what circumstances could the topic of a future blog entry. The scheduler is work-conserving. The x4800 mentioned above is an interesting system. Each of the 8 sockets houses an Intel 7500-series processor. Each processor has 3 coherent QPI links and the system is arranged as a glueless 8-socket twisted ladder "mobius" topology. Nodes are either 1 or 2 hops distant over the QPI links. As an aside the mapping of logical CPUIDs to physical resources is rather interesting on Solaris/x4800. On SPARC/Solaris the CPUID layout is strictly geographic, with the highest order bits identifying the socket, the next lower bits identifying the core within that socket, following by the pipeline (if present) and finally the logical thread context ("strand") on the core. But on Solaris on the x4800 the CPUID layout is as follows. [6:6] identifies the hyperthread on a core; bits [5:3] identify the socket, or package in Intel terminology; bits [2:0] identify the core within a socket. Such low-level details should be of interest only if you're binding threads -- a bad idea, the kernel typically handles placement best -- or if you're writing NUMA-aware code that's aware of the ambient placement and makes decisions accordingly. Solaris introduced the so-called critical-threads mechanism, which is expressed by putting a thread into the FX scheduling class at priority 60. The critical-threads mechanism applies to placement on cores, not on sockets, however. That is, it's an intra-socket policy, not an inter-socket policy. Solaris 11 introduces the Power Aware Dispatcher (PAD) which packs threads instead of spreading them out in an attempt to be able to keep sockets or cores at lower power levels. Maximum dispersal may be good for performance but is anathema to power management. PAD is off by default, but power management polices constitute yet another confounding factor with respect to scheduling and dispatching. If your threads communicate heavily -- one thread reads cache lines last written by some other thread -- then the new dense packing policy may improve performance by reducing traffic on the coherent interconnect. On the other hand if your threads in your process communicate rarely, then it's possible the new packing policy might result on contention on shared computing resources. Unfortunately there's no simple litmus test that says whether packing or spreading is optimal in a given situation. The answer varies by system load, application, number of threads, and platform hardware characteristics. Currently we don't have the necessary tools and sensoria to decide at runtime, so we're reduced to an empirical approach where we run trials and try to decide on a placement policy. The situation is quite frustrating. Relatedly, it's often hard to determine just the right level of concurrency to optimize throughput. (Understanding constructive vs destructive interference in the shared caches would be a good start. We could augment the lines with a small tag field indicating which strand last installed or accessed a line. Given that, we could augment the CPU with performance counters for misses where a thread evicts a line it installed vs misses where a thread displaces a line installed by some other thread.)

    Read the article

< Previous Page | 171 172 173 174 175 176 177 178 179 180 181 182  | Next Page >