Search Results

Search found 14169 results on 567 pages for 'parallel programming'.

Page 277/567 | < Previous Page | 273 274 275 276 277 278 279 280 281 282 283 284  | Next Page >

  • GPU Debugging with VS 11

    - by Daniel Moth
    With VS 11 Developer Preview we have invested tremendously in parallel debugging for both CPU (managed and native) and GPU debugging. I'll be doing a whole bunch of blog posts on those topics, and in this post I just wanted to get people started with GPU debugging, i.e. with debugging C++ AMP code. First I invite you to watch 6 minutes of a glimpse of the C++ AMP debugging experience though this video (ffw to minute 51:54, up until minute 59:16). Don't read the rest of this post, just go watch that video, ideally download the High Quality WMV. Summary GPU debugging essentially means debugging the lambda that you pass to the parallel_for_each call (plus any functions you call from the lambda, of course). CPU debugging means debugging all the code above and below the parallel_for_each call, i.e. all the code except the restrict(direct3d) lambda and the functions that it calls. With VS 11 you have to choose what debugger you want to use for a particular debugging session, CPU or GPU. So you can place breakpoints all over your code, then choose what debugger you want (CPU or GPU), and you'll only be able to hit breakpoints for the code type that the debugger engine understands – the remaining breakpoints will appear as unbound. If you want to hit the unbound breakpoints, you'd have to stop debugging, and start again with the other debugger. Sorry. We suck. We know. But once you are past that limitation, I think you'll find the experience truly rewarding – seriously! Switching debugger engines With the Developer Preview bits, one way to switch the debugger engine is through the project properties – see the screenshots that follow. This one is showing the CPU option selected, which is basically the default that you are all familiar with: This screenshot is showing the GPU option selected, by changing the debugger launcher (notice that this applies for both the local and remote case): You actually do not have to open the project properties just for switching the debugger engine, you can switch the selection from the toolbar in VS 11 Developer Preview too – see following screenshot (the effect is the same as if you opened the project properties and switched there) Breakpoint behavior Here are two screenshots, one showing a debugging session for CPU and the other a debugging session for GPU (notice the unbound breakpoints in each case) …and here is the GPU case (where we cannot bind the CPU breakpoints but can the GPU breakpoint, which is actually hit) Give C++ AMP debugging a try So to debug your C++ AMP code, pull down the drop down under the 'play' button to select the 'GPU C++ Direct3D Compute Debugger' menu option, then hit F5 (or the 'play' button itself). Then you can explore debugging by exploring the menus under the Debug and under the Debug->Windows menus. One way to do that exploration is through the C++ AMP debugging walkthrough on MSDN. Another way to explore the C++ AMP debugging experience, you can use the moth.cpp code file, which is what I used in my BUILD session debugger demo. Note that for my demo I was using the latest internal VS11 bits, so your experience with the Developer Preview bits won't be identical to what you saw me demonstrate, but it shouldn't be far off. Stay tuned for a lot more content on the parallel debugger in VS 11, both CPU and GPU, both managed and native. Comments about this post by Daniel Moth welcome at the original blog.

    Read the article

  • To ORM or Not to ORM. That is the question&hellip;

    - by Patrick Liekhus
    UPDATE:  Thanks for the feedback and comments.  I have adjusted my table below with your recommendations.  I had missed a point or two. I wanted to do a series on creating an entire project using the EDMX XAF code generation and the SpecFlow BDD Easy Test tools discussed in my earlier posts, but I thought it would be appropriate to start with a simple comparison and reasoning on why I choose to use these tools. Let’s start by defining the term ORM, or Object-Relational Mapping.  According to Wikipedia it is defined as the following: Object-relational mapping (ORM, O/RM, and O/R mapping) in computer software is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language. Why should you care?  Basically it allows you to map your business objects in code to their persistence layer behind them. And better yet, why would you want to do this?  Let me outline it in the following points: Development speed.  No more need to map repetitive tasks query results to object members.  Once the map is created the code is rendered for you. Persistence portability.  The ORM knows how to map SQL specific syntax for the persistence engine you choose.  It does not matter if it is SQL Server, Oracle and another database of your choosing. Standard/Boilerplate code is simplified.  The basic CRUD operations are consistent and case use database metadata for basic operations. So how does this help?  Well, let’s compare some of the ORM tools that I have used and/or researched.  I have been interested in ORM for some time now.  My ORM of choice for a long time was NHibernate and I still believe it has a strong case in some business situations.  However, you have to take business considerations into account and the law of diminishing returns.  Because of these two factors, my recent activity and experience has been around DevExpress eXpress Persistence Objects (XPO).  The primary reason for this is because they have the DevExpress eXpress Application Framework (XAF) that sits on top of XPO.  With this added value, the data model can be created (either database first of code first) and the Web and Windows client can be created from these maps.  While out of the box they provide some simple list and detail screens, you can verify easily extend and modify these to your liking.  DevExpress has done a tremendous job of providing enough framework while also staying out of the way when you need to extend it.  This sounds worse than it really is.  What I mean by this is that if you choose to follow DevExpress coding style and recommendations, the hooks and extension points provided allow you to do some pretty heavy lifting while also not worrying about the basics. I have put together a list of the top features that I have used to compare the limited list of ORM’s that I have exposure with.  Again, the biggest selling point in my opinion is that XPO is just a solid as any of the other ORM’s but with the added layer of XAF they become unstoppable.  And then couple that with the EDMX modeling tools and code generation, it becomes a no brainer. Designer Features Entity Framework NHibernate Fluent w/ Nhibernate Telerik OpenAccess DevExpress XPO DevExpress XPO/XAF plus Liekhus Tools Uses XML to map relationships - Yes - - -   Visual class designer interface Yes - - - - Yes Management integrated w/ Visual Studio Yes - - Yes - Yes Supports schema first approach Yes - - Yes - Yes Supports model first approach Yes - - Yes Yes Yes Supports code first approach Yes Yes Yes Yes Yes Yes Attribute driven coding style Yes - Yes - Yes Yes                 I have a very small team and limited resources with a lot of responsibilities.  In order to keep up with our customers, we must rely on tools like these.  We use the EDMX tool so that we can create a visual representation of the applications with our customers.  Second, we rely on the code generation so that we can focus on the business problems at hand and not whether a field is mapped correctly.  This keeps us from requiring as many junior level developers on our team.  I have also worked on multiple teams where they believed in writing their own “framework”.  In my experiences and opinion this is not the route to take unless you have a team dedicated to supporting just the framework.  Each time that I have worked on custom frameworks, the framework eventually becomes old, out dated and full of “performance” enhancements specific to one or two requirements.  With an ORM, there are a lot smarter people than me working on the bigger issue of persistence and performance.  Again, my recommendation would be to use an available framework and get to working on your business domain problems.  If your coding is not making money for you, why are you working on it?  Do you really need to be writing query to object member code again and again? Thanks

    Read the article

  • Parallelism in .NET – Part 17, Think Continuations, not Callbacks

    - by Reed
    In traditional asynchronous programming, we’d often use a callback to handle notification of a background task’s completion.  The Task class in the Task Parallel Library introduces a cleaner alternative to the traditional callback: continuation tasks. Asynchronous programming methods typically required callback functions.  For example, MSDN’s Asynchronous Delegates Programming Sample shows a class that factorizes a number.  The original method in the example has the following signature: public static bool Factorize(int number, ref int primefactor1, ref int primefactor2) { //... .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } However, calling this is quite “tricky”, even if we modernize the sample to use lambda expressions via C# 3.0.  Normally, we could call this method like so: int primeFactor1 = 0; int primeFactor2 = 0; bool answer = Factorize(10298312, ref primeFactor1, ref primeFactor2); Console.WriteLine("{0}/{1} [Succeeded {2}]", primeFactor1, primeFactor2, answer); If we want to make this operation run in the background, and report to the console via a callback, things get tricker.  First, we need a delegate definition: public delegate bool AsyncFactorCaller( int number, ref int primefactor1, ref int primefactor2); Then we need to use BeginInvoke to run this method asynchronously: int primeFactor1 = 0; int primeFactor2 = 0; AsyncFactorCaller caller = new AsyncFactorCaller(Factorize); caller.BeginInvoke(10298312, ref primeFactor1, ref primeFactor2, result => { int factor1 = 0; int factor2 = 0; bool answer = caller.EndInvoke(ref factor1, ref factor2, result); Console.WriteLine("{0}/{1} [Succeeded {2}]", factor1, factor2, answer); }, null); This works, but is quite difficult to understand from a conceptual standpoint.  To combat this, the framework added the Event-based Asynchronous Pattern, but it isn’t much easier to understand or author. Using .NET 4’s new Task<T> class and a continuation, we can dramatically simplify the implementation of the above code, as well as make it much more understandable.  We do this via the Task.ContinueWith method.  This method will schedule a new Task upon completion of the original task, and provide the original Task (including its Result if it’s a Task<T>) as an argument.  Using Task, we can eliminate the delegate, and rewrite this code like so: var background = Task.Factory.StartNew( () => { int primeFactor1 = 0; int primeFactor2 = 0; bool result = Factorize(10298312, ref primeFactor1, ref primeFactor2); return new { Result = result, Factor1 = primeFactor1, Factor2 = primeFactor2 }; }); background.ContinueWith(task => Console.WriteLine("{0}/{1} [Succeeded {2}]", task.Result.Factor1, task.Result.Factor2, task.Result.Result)); This is much simpler to understand, in my opinion.  Here, we’re explicitly asking to start a new task, then continue the task with a resulting task.  In our case, our method used ref parameters (this was from the MSDN Sample), so there is a little bit of extra boiler plate involved, but the code is at least easy to understand. That being said, this isn’t dramatically shorter when compared with our C# 3 port of the MSDN code above.  However, if we were to extend our requirements a bit, we can start to see more advantages to the Task based approach.  For example, supposed we need to report the results in a user interface control instead of reporting it to the Console.  This would be a common operation, but now, we have to think about marshaling our calls back to the user interface.  This is probably going to require calling Control.Invoke or Dispatcher.Invoke within our callback, forcing us to specify a delegate within the delegate.  The maintainability and ease of understanding drops.  However, just as a standard Task can be created with a TaskScheduler that uses the UI synchronization context, so too can we continue a task with a specific context.  There are Task.ContinueWith method overloads which allow you to provide a TaskScheduler.  This means you can schedule the continuation to run on the UI thread, by simply doing: Task.Factory.StartNew( () => { int primeFactor1 = 0; int primeFactor2 = 0; bool result = Factorize(10298312, ref primeFactor1, ref primeFactor2); return new { Result = result, Factor1 = primeFactor1, Factor2 = primeFactor2 }; }).ContinueWith(task => textBox1.Text = string.Format("{0}/{1} [Succeeded {2}]", task.Result.Factor1, task.Result.Factor2, task.Result.Result), TaskScheduler.FromCurrentSynchronizationContext()); This is far more understandable than the alternative.  By using Task.ContinueWith in conjunction with TaskScheduler.FromCurrentSynchronizationContext(), we get a simple way to push any work onto a background thread, and update the user interface on the proper UI thread.  This technique works with Windows Presentation Foundation as well as Windows Forms, with no change in methodology.

    Read the article

  • When to implement: Together with or after the source product?

    - by Jeremy Oosthuizen
    Somebody recently relayed a prospect's question to me: How hard would it be to implement OUBI after the source product (CC&B, WAM or NMS) has already been implemented? Fact is that MOST non-OUBI Data Warehouse / Business Intelligence implementations take place after the source application(s) are in place and hopefully stable. If an organization decides that they need better reporting and management information, then the logical path (see The Data Warehouse Institute's Data Warehouse Maturity Model) is to a Data Warehouse -- no matter when their last applications were implemented. If there is a pre-built Data Warehouse for their specific application, or even for the desired business process in their industry, they're in luck. Else they have to design and build from scratch, using a toolset. The implementation of a toolset is unlike the implementation of OUBI which, like OBI Apps, contain pre-built ETL routines and user content. Much has been written before about the advantages of that. So, because OUBI is designed specifically for Oracle Utilities transactional products, we often implement them in parallel -- with OUBI lagging a little behind by necessity, like Reporting. Customers know from the start they're going to need the solution, and therefore purchase the products at the same time. My biggest argument FOR a parallel installation/implementation of OUBI with the source product is two-fold: - There could be things (which is the technical term for data elements) that customers figure out they need when implementing OUBI, which are often easier added to the source product's implementation project, than to add later; - OUBI's ETL often points out errors (severe or not) with converted data, which are easier to fix during the source product's implementation project, or it may even be impossible to fix afterwards. The Conversion routines sometimes miss these errors, because the source system can live with the not-quite-perfect converted data. If the data can't be properly extracted, i.e. the proper Dimensions linked to the Facts, then it can't get into OUBI. That means it can't be analyzed effectively along with the rest of the organization's data. Then there is also the throw-away-work argument, which may be significant. The operational / transactional system cannot go live without reports on Day 1. A lot of those reports would be taken care of by the implementation of OUBI. If OUBI is implemented after go-live, those reports STILL have to be built during the source product's implementation project, but they become throw-away after the OUBI implementation. I have sometimes been told that it is better to implement OUBI after the source product, because it cuts down on scope and risk for the source product's implementation project. All I can say to that, is bah humbug. No, seriously, given the arguments above, planning has to include the OUBI implementation and it has to be managed properly -- just like any other implementation. If so, it should not add any risk and it should be included in the scope from the start. The answer to the prospect's question is therefore that it is not that much more difficult; after all, most DW/BI implemenations are done like that. They just have to consider the points above.

    Read the article

  • Analysis Services (SSAS) - Unexpected Internal Error when processing (ProcessUpdate). Workaround/Resolution

    - by James Rogers
    Many implementations require the use of ProcessUpdate to support Type 1 slowly changing dimensions. ProcessUpdate drops all of the affected indexes and aggregations in partitions affected by data that changes in the Dimension on which the ProcessUpdate is being performed. Twice now I have had situations where the processing fails with "Internal error: An unexpected exception occurred." Any subsequent ProcessUpdate processing will also fail with the same error. In talking with Microsoft the issue is corrupt indexes for the Dimension(s) being processed in the partitions of the affected measure group. I cannot guarantee that the following will correct your problem but it did in my case and saved us quite a bit of down time.   Workaround: ProcessIndexes on the entire cube that is being processed and throwing the error. This corrected the problem on both 2008 and 2008 R2.   Pros:  Does not require a complete rebuild of the data (ProcessFull) for either the Dimension or Cube. User access can continue while this ProcessIndexes in underway.   Cons: Can take a long time, especially on large cubes with many partitions, dimensions and/or aggregations. Query Performance is usually severely impacted due to the memory and CPU requirements for Aggregation and Index building   <Batch http://schemas.microsoft.com/analysisservices/2003/engine"http://schemas.microsoft.com/analysisservices/2003/engine">  <Parallel>     <Process xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ddl2="http://schemas.microsoft.com/analysisservices/2003/engine/2" xmlns:ddl2_2="http://schemas.microsoft.com/analysisservices/2003/engine/2/2" xmlns:ddl100_100="http://schemas.microsoft.com/analysisservices/2008/engine/100/100" xmlns:ddl200="http://schemas.microsoft.com/analysisservices/2010/engine/200" xmlns:ddl200_200="http://schemas.microsoft.com/analysisservices/2010/engine/200/200">       <Object>         <DatabaseID>MyDatabase</DatabaseID>         <CubeID>MyCube</CubeID>       </Object>       <Type>ProcessIndexes</Type>       <WriteBackTableCreation>UseExisting</WriteBackTableCreation>     </Process>  </Parallel> </Batch>   The cube where the corruption exists can be found by having Profiler running while the ProcessUpdate is executing. The first partition that displays the "The Job has ended in failure." message in the TextData column will be part of the cube/measuregroup that has the corruption. You can try to run ProcessIndexes on just that measure group. This may correct the problem and save additional time if you have other large measure groups in the cube that are not affected by the corruption.   Remember to execute your normal ProcessUpdate batch after the successful completion of the ProcessIndexes. The ProcessIndexes does not pick up data changes.   Things that did not work: ProcessClearIndexes - why this doesn't work and ProcessIndexes does is unclear at this point. ProcessFull on the partition in question. In my latest case, this would clear up the problem for that partition. However, the next partition the ProcessUpdate touched that had data in it would generate and error. This leads me to believe the corruption problem will exist in all partitions in the affected measure group that have data in them.   NOTE: I experience this problem in both a SQL 2008 and SQL 2008 R2 Analysis Services environment, on separate built from the same relational database. This leads me to believe that some data condition in the tables used for the Dimension processing caused the corruption since the two environments were on physically separate hardware. I am waiting on Microsoft to analyze the dumps to give us more insight into what actually caused the corruption and will update this post accordingly.

    Read the article

  • Triangle Line-Segment Intersection - detecting near misses

    - by Will
    A ray is a very poor approximation of a player! I think approximating a player with a sphere traveling a straight line each game tick will solve my problems of the player intersecting edges of scenery because their line segment missed it yet their own model is not infinitely thin... I have a 3D triangle and a line segment. I have the normal triangle-line-segment intersection code which I admit I have only a woolly grasp of. To model movement and compute collisions of the player I have to determine if a line passes within sphere-radius of a triangle. But I can find no convenient line near-miss intersection code! Here's the classic triangle intersection ### commented ### code with my starting assumptions: function triangle_ray_intersection(a,b,c,ray_origin,ray_dir,ray_radius) { // http://softsurfer.com/Archive/algorithm_0105/algorithm_0105.htm#intersect_RayTriangle%28%29 // get triangle edge vectors and plane normal var u = vec3_sub(b,a); var v = vec3_sub(c,a); var n = vec3_cross(u,v); if(n[0]==0 && n[1]==0 && n[2]==0) return null; // triangle is degenerate var w0 = vec3_sub(ray_origin,a); var j = vec3_dot(n,ray_dir); if(Math.abs(j) < 0.00000001) { //### if parallel, might still pass within ray_radius of it return null; // parallel, disjoint or on plane } var i = -vec3_dot(n,w0); // get intersect point of ray with triangle plane var k = i / j; if(k < 0.0) return null; // ray goes away from triangle //### as its a line segment, k > 1+ray_radius means no intersect var hit = vec3_add(ray_origin,vec3_scale(ray_dir,k)); // intersect point of ray and plane // is I inside T? //### here I'm a bit lost; this is presumably computing barycentric coordinates? var uu = vec3_dot(u,u); var uv = vec3_dot(u,v); var vv = vec3_dot(v,v); var w = vec3_sub(hit,a); var wu = vec3_dot(w,u); var wv = vec3_dot(w,v); var D = uv * uv - uu * vv; var s = (uv * wv - vv * wu) / D; //### therefore, compute if its within ray_radius scaled to the 0..1 of barycentric coordinates? if(s<0.0 || s>1.0) return null; // I is outside T var t = (uv * wu - uu * wv) / D; if(t<0.0 || (s+t)>1.0) return null; // I is outside T //### finally, if it passses a barycentric test it might still be too far //### to a point; must check that its distance from a corner is within ray_radius too if more than one barycentric coord is >1 //### so we have rounded corners... return [hit,n]; // I is in T } Given the distance between the point of plane intersection and each corner, I ought to be able to determine distance at world scale of how far beyond the edge - beyond 1.0 in barycentric coordinates for each axis - that point is... At this point my head explodes! Is this the right track? What's the actual code? UPDATE: you can earn 100 pts on SO if you answer this question there...! How can you determine if a line segment passes within some distance of a triangle?

    Read the article

  • ODI 11g – Oracle Multi Table Insert

    - by David Allan
    With the IKM Oracle Multi Table Insert you can generate Oracle specific DML for inserting into multiple target tables from a single query result – without reprocessing the query or staging its result. When designing this to exploit the IKM you must split the problem into the reusable parts – the select part goes in one interface (I named SELECT_PART), then each target goes in a separate interface (INSERT_SPECIAL and INSERT_REGULAR). So for my statement below… /*INSERT_SPECIAL interface */ insert  all when 1=1 And (INCOME_LEVEL > 250000) then into SCOTT.CUSTOMERS_NEW (ID, NAME, GENDER, BIRTH_DATE, MARITAL_STATUS, INCOME_LEVEL, CREDIT_LIMIT, EMAIL, USER_CREATED, DATE_CREATED, USER_MODIFIED, DATE_MODIFIED) values (ID, NAME, GENDER, BIRTH_DATE, MARITAL_STATUS, INCOME_LEVEL, CREDIT_LIMIT, EMAIL, USER_CREATED, DATE_CREATED, USER_MODIFIED, DATE_MODIFIED) /* INSERT_REGULAR interface */ when 1=1  then into SCOTT.CUSTOMERS_SPECIAL (ID, NAME, GENDER, BIRTH_DATE, MARITAL_STATUS, INCOME_LEVEL, CREDIT_LIMIT, EMAIL, USER_CREATED, DATE_CREATED, USER_MODIFIED, DATE_MODIFIED) values (ID, NAME, GENDER, BIRTH_DATE, MARITAL_STATUS, INCOME_LEVEL, CREDIT_LIMIT, EMAIL, USER_CREATED, DATE_CREATED, USER_MODIFIED, DATE_MODIFIED) /*SELECT*PART interface */ select        CUSTOMERS.EMAIL EMAIL,     CUSTOMERS.CREDIT_LIMIT CREDIT_LIMIT,     UPPER(CUSTOMERS.NAME) NAME,     CUSTOMERS.USER_MODIFIED USER_MODIFIED,     CUSTOMERS.DATE_MODIFIED DATE_MODIFIED,     CUSTOMERS.BIRTH_DATE BIRTH_DATE,     CUSTOMERS.MARITAL_STATUS MARITAL_STATUS,     CUSTOMERS.ID ID,     CUSTOMERS.USER_CREATED USER_CREATED,     CUSTOMERS.GENDER GENDER,     CUSTOMERS.DATE_CREATED DATE_CREATED,     CUSTOMERS.INCOME_LEVEL INCOME_LEVEL from    SCOTT.CUSTOMERS   CUSTOMERS where    (1=1) Firstly I create a SELECT_PART temporary interface for the query to be reused and in the IKM assignment I state that it is defining the query, it is not a target and it should not be executed. Then in my INSERT_SPECIAL interface loading a target with a filter, I set define query to false, then set true for the target table and execute to false. This interface uses the SELECT_PART query definition interface as a source. Finally in my final interface loading another target I set define query to false again, set target table to true and execute to true – this is the go run it indicator! To coordinate the statement construction you will need to create a package with the select and insert statements. With 11g you can now execute the package in simulation mode and preview the generated code including the SQL statements. Hopefully this helps shed some light on how you can leverage the Oracle MTI statement. A similar IKM exists for Teradata. The ODI IKM Teradata Multi Statement supports this multi statement request in 11g, here is an extract from the paper at www.teradata.com/white-papers/born-to-be-parallel-eb3053/ Teradata Database offers an SQL extension called a Multi-Statement Request that allows several distinct SQL statements to be bundled together and sent to the optimizer as if they were one. Teradata Database will attempt to execute these SQL statements in parallel. When this feature is used, any sub-expressions that the different SQL statements have in common will be executed once, and the results shared among them. It works in the same way as the ODI MTI IKM, multiple interfaces orchestrated in a package, each interface contributes some SQL, the last interface in the chain executes the multi statement.

    Read the article

  • F# and the rose-tinted reflection

    - by CliveT
    We're already seeing increasing use of many cores on client desktops. It is a change that has been long predicted. It is not just a change in architecture, but our notions of efficiency in a program. No longer can we focus on the asymptotic complexity of an algorithm by counting the steps that a single core processor would take to execute it. Instead we'll soon be more concerned about the scalability of the algorithm and how well we can increase the performance as we increase the number of cores. This may even lead us to throw away our most efficient algorithms, and switch to less efficient algorithms that scale better. We might even be willing to waste cycles in order to speculatively execute at the algorithm rather than the hardware level. State is the big headache in this parallel world. At the hardware level, main memory doesn't necessarily contain the definitive value corresponding to a particular address. An update to a location might still be held in a CPU's local cache and it might be some time before the value gets propagated. To get the latest value, and the notion of "latest" takes a lot of defining in this world of rapidly mutating state, the CPUs may well need to communicate to decide who has the definitive value of a particular address in order to avoid lost updates. At the user program level, this means programmers will need to lock objects before modifying them, or attempt to avoid the overhead of locking by understanding the memory models at a very deep level. I think it's this need to avoid statefulness that has led to the recent resurgence of interest in functional languages. In the 1980s, functional languages started getting traction when research was carried out into how programs in such languages could be auto-parallelised. Sadly, the impracticality of some of the languages, the overheads of communication during this parallel execution, and rapid improvements in compiler technology on stock hardware meant that the functional languages fell by the wayside. The one thing that these languages were good at was getting rid of implicit state, and this single idea seems like a solution to the problems we are going to face in the coming years. Whether these languages will catch on is hard to predict. The mindset for writing a program in a functional language is really very different from the way that object-oriented problem decomposition happens - one has to focus on the verbs instead of the nouns, which takes some getting used to. There are a number of hybrid functional/object languages that have been becoming more popular in recent times. These half-way houses make it easy to use functional ideas for some parts of the program while still allowing access to the underlying object-focused platform without a great deal of impedance mismatch. One example is F# running on the CLR which, in Visual Studio 2010, has because a first class member of the pack. Inside Visual Studio 2010, the tooling for F# has improved to the point where it is easy to set breakpoints and watch values change while debugging at the source level. In my opinion, it is the tooling support that will enable the widespread adoption of functional languages - without this support, people will put off any transition into the functional world for as long as they possibly can. Without tool support it will make it hard to learn these languages. One tool that doesn't currently support F# is Reflector. The idea of decompiling IL to a functional language is daunting, but F# is potentially so important I couldn't dismiss the idea. As I'm currently developing Reflector 6.5, I thought it wise to take four days just to see how far I could get in doing so, even if it achieved little more than to be clearer on how much was possible, and how long it might take. You can read what happened here, and of the insights it gave us on ways to improve the tool.

    Read the article

  • Give a session on C++ AMP – here is how

    - by Daniel Moth
    Ever since presenting on C++ AMP at the AMD Fusion conference in June, then the Gamefest conference in August, and the BUILD conference in September, I've had numerous requests about my material from folks that want to re-deliver the same session. The C++ AMP session I put together has evolved over the 3 presentations to its final form that I used at BUILD, so that is the one I recommend you base yours on. Please get the slides and the recording from channel9 (I'll refer to slide numbers below). This is how I've been presenting the C++ AMP session: Context (slide 3, 04:18-08:18) Start with a demo, on my dual-GPU machine. I've been using the N-Body sample (for VS 11 Developer Preview). (slide 4) Use an nvidia slide that has additional examples of performance improvements that customers enjoy with heterogeneous computing. (slide 5) Talk a bit about the differences today between CPU and GPU hardware, leading to the fact that these will continue to co-exist and that GPUs are great for data parallel algorithms, but not much else today. One is a jack of all trades and the other is a number cruncher. (slide 6) Use the APU example from amd, as one indication that the hardware space is still in motion, emphasizing that the C++ AMP solution is a data parallel API, not a GPU API. It has a future proof design for hardware we have yet to see. (slide 7) Provide more meta-data, as blogged about when I first introduced C++ AMP. Code (slide 9-11) Introduce C++ AMP coding with a simplistic array-addition algorithm – the slides speak for themselves. (slide 12-13) index<N>, extent<N>, and grid<N>. (Slide 14-16) array<T,N>, array_view<T,N> and comparison between them. (Slide 17) parallel_for_each. (slide 18, 21) restrict. (slide 19-20) actual restrictions of restrict(direct3d) – the slides speak for themselves. (slide 22) bring it altogether with a matrix multiplication example. (slide 23-24) accelerator, and accelerator_view. (slide 26-29) Introduce tiling incl. tiled matrix multiplication [tiling probably deserves a whole session instead of 6 minutes!]. IDE (slide 34,37) Briefly touch on the concurrency visualizer. It supports GPU profiling, but enhancements specific to C++ AMP we hope will come at the Beta timeframe, which is when I'll be spending more time talking about it. (slide 35-36, 51:54-59:16) Demonstrate the GPU debugging experience in VS 11. Summary (slide 39) Re-iterate some of the points of slide 7, and add the point that the C++ AMP spec will be open for other compiler vendors to implement, even on other platforms (in fact, Microsoft is actively working on that). (slide 40) Links to content – see slide – including where all your questions should go: http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads.   "But I don't have time for a full blown session, I only need 2 (or just 1, or 3) C++ AMP slides to use in my session on related topic X" If all you want is a small number of slides, you can take some from the session above and customize them. But because I am so nice, I have created some slides for you, including talking points in the notes section. Download them here. Comments about this post by Daniel Moth welcome at the original blog.

    Read the article

  • SQL Azure: Notes on Building a Shard Technology

    - by Herve Roggero
    In Chapter 10 of the book on SQL Azure (http://www.apress.com/book/view/9781430229612) I am co-authoring, I am digging deeper in what it takes to write a Shard. It's actually a pretty cool exercise, and I wanted to share some thoughts on how I am designing the technology. A Shard is a technology that spreads the load of database requests over multiple databases, as transparently as possible. The type of shard I am building is called a Vertical Partition Shard  (VPS). A VPS is a mechanism by which the data is stored in one or more databases behind the scenes, but your code has no idea at design time which data is in which database. It's like having a mini cloud for records instead of services. Imagine you have three SQL Azure databases that have the same schema (DB1, DB2 and DB3), you would like to issue a SELECT * FROM Users on all three databases, concatenate the results into a single resultset, and order by last name. Imagine you want to ensure your code doesn't need to change if you add a new database to the shard (DB4). Now imagine that you want to make sure all three databases are queried at the same time, in a multi-threaded manner so your code doesn't have to wait for three database calls sequentially. Then, imagine you would like to obtain a breadcrumb (in the form of a new, virtual column) that gives you a hint as to which database a record came from, so that you could update it if needed. Now imagine all that is done through the standard SqlClient library... and you have the Shard I am currently building. Here are some lessons learned and techniques I am using with this shard: Parellel Processing: Querying databases in parallel is not too hard using the Task Parallel Library; all you need is to lock your resources when needed Deleting/Updating Data: That's not too bad either as long as you have a breadcrumb. However it becomes more difficult if you need to update a single record and you don't know in which database it is. Inserting Data: I am using a round-robin approach in which each new insert request is directed to the next database in the shard. Not sure how to deal with Bulk Loads just yet... Shard Databases:  I use a static collection of SqlConnection objects which needs to be loaded once; from there on all the Shard commands use this collection Extension Methods: In order to make it look like the Shard commands are part of the SqlClient class I use extension methods. For example I added ExecuteShardQuery and ExecuteShardNonQuery methods to SqlClient. Exceptions: Capturing exceptions in a multi-threaded code is interesting... but I kept it simple for now. I am using the ConcurrentQueue to store my exceptions. Database GUID: Every database in the shard is given a GUID, which is calculated based on the connection string's values. DataTable. The Shard methods return a DataTable object which can be bound to objects.  I will be sharing the code soon as an open-source project in CodePlex. Please stay tuned on twitter to know when it will be available (@hroggero). Or check www.bluesyntax.net for updates on the shard. Thanks!

    Read the article

  • My Red Gate Experience

    - by Colin Rothwell
    I’m Colin, and I’ve been an intern working with Mike in publishing on Simple-Talk and SQLServerCentral for the past ten weeks. I’ve mostly been working “behind the scenes”, making improvements to the spam filtering, along with various other small tweaks. When I arrived at Red Gate, one of the first things Mike asked me was what I wanted to get out of the internship. It wasn’t a question I’d given a great deal of thought to, but my immediate response was the same as almost anybody: to support my growing family. Well, ok, not quite that, but money was certainly a motivator, along with simply making sure that I didn’t get bored over the summer. Three months is a long time to fill, and many of my friends end up getting bored, or worse, knitting obsessively. With the arrogance which seems fairly common among Cambridge people, I wasn’t expecting to really learn much here! In my mind, the part of the year where I am at Uni is the part where I learn things, whilst Red Gate would be an opportunity to apply what I’d learnt. Thankfully, the opposite is true: I’ve learnt a lot during my time here, and there has been a definite positive impact on the way I write code. The first thing I’ve really learnt is that test-driven development is, in general, a sensible way of working. Before coming, I didn’t really get it: how could you test something you hadn’t yet written? It didn’t make sense! My problem was seeing a test as having to test all the behaviour of a given function. Writing tests which test the bare minimum possible and building them up is a really good way of crystallising the direction the code needs to grow in, and ensures you never attempt to write too much code at time. One really good experience of this was early on in my internship when Mike and I were working on the query used to list active authors: I’d written something which I thought would do the trick, but by starting again using TDD we grew something which revealed that there were several subtle mistakes in the query I’d written. I’ve also been awakened to the value of pair programming. Whilst I could sort of see the point before coming, I also thought that it was impossible that two people would ever get more done at the same computer than if they were working separately. I still think that this is true for projects with pieces that developers can easily work on independently, and with developers who both know the codebase, but I’ve found that pair programming can be really good for learning a code base, and for building up small projects to the point where you can start working on separate components, as well as solving particularly difficult problems. Later on in my internship, for my down tools week project, I was working on adding Python support to Glimpse. Another intern and I we pair programmed the entire project, using ping pong pair programming as much as possible. One bonus that this brought which I wasn’t expecting was that I found myself less prone to distraction: with someone else peering over my shoulder, I didn’t have the ever-present temptation to open gmail, or facebook, or yammer, or twitter, or hacker news, or reddit, and so on, and so forth. I’m quite proud of this project: I think it’s some of the best code I’ve written. I’ve also been really won over to the value of descriptive variables names. In my pre-Red Gate life, as a lone-ranger style cowboy programmer, I’d developed a tendency towards laziness in variable names, sometimes abbreviating or, worse, using acronyms. I’ve swiftly realised that this is a bad idea when working with a team: saving a few key strokes is inevitably not worth it when it comes to reading code again in the future. Longer names also mean you can do away with a majority of comments. I appreciate that if you’ve come up with an O(n*log n) algorithm for something which seemed O(n^2), you probably want to explain how it works, but explaining what a variable name means is a big no no: it’s so very easy to change the behaviour of the code, whilst forgetting about the comments. Whilst at Red Gate, I took the opportunity to attend a code retreat, which really helped me to solidify all the things I’d learnt. To be completely free of any existing code base really lets you focus on best practises and think about how you write code. If you get a chance to go on a similar event, I’d highly recommend it! Cycling to Red Gate, I’ve also become much better at fitting inner tubes: if you’re struggling to get the tube out, or re-fit the tire, letting a bit of air out usually helps. I’ve also become quite a bit better at foosball and will miss having a foosball table! I’d like to finish off by saying thank you to everyone at Red Gate for having me. I’ve really enjoyed working with, and learning from, the team that brings you this web site. If you meet any of them, buy them a drink!

    Read the article

  • The Joy Of Hex

    - by Jim Giercyk
    While working on a mainframe integration project, it occurred to me that some basic computer concepts are slipping into obscurity. For example, just about anyone can tell you that a 64-bit processor is faster than a 32-bit processer. A grade school child could tell you that a computer “speaks” in ‘1’s and ‘0’s. Some people can even tell you that there are 8 bits in a byte. However, I have found that even the most seasoned developers often can’t explain the theory behind those statements. That is not a knock on programmers; in the age of IntelliSense, what reason do we have to work with data at the bit level? Many computer theory classes treat bit-level programming as a thing of the past, no longer necessary now that storage space is plentiful. The trouble with that mindset is that the world is full of legacy systems that run programs written in the 1970’s.  Today our jobs require us to extract data from those systems, regardless of the format, and that often involves low-level programming. Because it seems knowledge of the low-level concepts is waning in recent times, I thought a review would be in order.       CHARACTER: See Spot Run HEX: 53 65 65 20 53 70 6F 74 20 52 75 6E DECIMAL: 83 101 101 32 83 112 111 116 32 82 117 110 BINARY: 01010011 01100101 01100101 00100000 01010011 01110000 01101111 01110100 00100000 01010010 01110101 01101110 In this example, I have broken down the words “See Spot Run” to a level computers can understand – machine language.     CHARACTER:  The character level is what is rendered by the computer.  A “Character Set” or “Code Page” contains 256 characters, both printable and unprintable.  Each character represents 1 BYTE of data.  For example, the character string “See Spot Run” is 12 Bytes long, exclusive of the quotation marks.  Remember, a SPACE is an unprintable character, but it still requires a byte.  In the example I have used the default Windows character set, ASCII, which you can see here:  http://www.asciitable.com/ HEX:  Hex is short for hexadecimal, or Base 16.  Humans are comfortable thinking in base ten, perhaps because they have 10 fingers and 10 toes; fingers and toes are called digits, so it’s not much of a stretch.  Computers think in Base 16, with numeric values ranging from zero to fifteen, or 0 – F.  Each decimal place has a possible 16 values as opposed to a possible 10 values in base 10.  Therefore, the number 10 in Hex is equal to the number 16 in Decimal.  DECIMAL:  The Decimal conversion is strictly for us humans to use for calculations and conversions.  It is much easier for us humans to calculate that [30 – 10 = 20] in decimal than it is for us to calculate [1E – A = 14] in Hex.  In the old days, an error in a program could be found by determining the displacement from the entry point of a module.  Since those values were dumped from the computers head, they were in hex. A programmer needed to convert them to decimal, do the equation and convert back to hex.  This gets into relative and absolute addressing, a topic for another day.  BINARY:  Binary, or machine code, is where any value can be expressed in 1s and 0s.  It is really Base 2, because each decimal place can have a possibility of only 2 characters, a 1 or a 0.  In Binary, the number 10 is equal to the number 2 in decimal. Why only 1s and 0s?  Very simply, computers are made up of lots and lots of transistors which at any given moment can be ON ( 1 ) or OFF ( 0 ).  Each transistor is a bit, and the order that the transistors fire (or not fire) is what distinguishes one value from  another in the computers head (or CPU).  Consider 32 bit vs 64 bit processing…..a 64 bit processor has the capability to read 64 transistors at a time.  A 32 bit processor can only read half as many at a time, so in theory the 64 bit processor should be much faster.  There are many more factors involved in CPU performance, but that is the fundamental difference.    DECIMAL HEX BINARY 0 0 0000 1 1 0001 2 2 0010 3 3 0011 4 4 0100 5 5 0101 6 6 0110 7 7 0111 8 8 1000 9 9 1001 10 A 1010 11 B 1011 12 C 1100 13 D 1101 14 E 1110 15 F 1111   Remember that each character is a BYTE, there are 2 HEX characters in a byte (called nibbles) and 8 BITS in a byte.  I hope you enjoyed reading about the theory of data processing.  This is just a high-level explanation, and there is much more to be learned.  It is safe to say that, no matter how advanced our programming languages and visual studios become, they are nothing more than a way to interpret bits and bytes.  There is nothing like the joy of hex to get the mind racing.

    Read the article

  • Give a session on C++ AMP – here is how

    - by Daniel Moth
    Ever since presenting on C++ AMP at the AMD Fusion conference in June, then the Gamefest conference in August, and the BUILD conference in September, I've had numerous requests about my material from folks that want to re-deliver the same session. The C++ AMP session I put together has evolved over the 3 presentations to its final form that I used at BUILD, so that is the one I recommend you base yours on. Please get the slides and the recording from channel9 (I'll refer to slide numbers below). This is how I've been presenting the C++ AMP session: Context (slide 3, 04:18-08:18) Start with a demo, on my dual-GPU machine. I've been using the N-Body sample (for VS 11 Developer Preview). (slide 4) Use an nvidia slide that has additional examples of performance improvements that customers enjoy with heterogeneous computing. (slide 5) Talk a bit about the differences today between CPU and GPU hardware, leading to the fact that these will continue to co-exist and that GPUs are great for data parallel algorithms, but not much else today. One is a jack of all trades and the other is a number cruncher. (slide 6) Use the APU example from amd, as one indication that the hardware space is still in motion, emphasizing that the C++ AMP solution is a data parallel API, not a GPU API. It has a future proof design for hardware we have yet to see. (slide 7) Provide more meta-data, as blogged about when I first introduced C++ AMP. Code (slide 9-11) Introduce C++ AMP coding with a simplistic array-addition algorithm – the slides speak for themselves. (slide 12-13) index<N>, extent<N>, and grid<N>. (Slide 14-16) array<T,N>, array_view<T,N> and comparison between them. (Slide 17) parallel_for_each. (slide 18, 21) restrict. (slide 19-20) actual restrictions of restrict(direct3d) – the slides speak for themselves. (slide 22) bring it altogether with a matrix multiplication example. (slide 23-24) accelerator, and accelerator_view. (slide 26-29) Introduce tiling incl. tiled matrix multiplication [tiling probably deserves a whole session instead of 6 minutes!]. IDE (slide 34,37) Briefly touch on the concurrency visualizer. It supports GPU profiling, but enhancements specific to C++ AMP we hope will come at the Beta timeframe, which is when I'll be spending more time talking about it. (slide 35-36, 51:54-59:16) Demonstrate the GPU debugging experience in VS 11. Summary (slide 39) Re-iterate some of the points of slide 7, and add the point that the C++ AMP spec will be open for other compiler vendors to implement, even on other platforms (in fact, Microsoft is actively working on that). (slide 40) Links to content – see slide – including where all your questions should go: http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads.   "But I don't have time for a full blown session, I only need 2 (or just 1, or 3) C++ AMP slides to use in my session on related topic X" If all you want is a small number of slides, you can take some from the session above and customize them. But because I am so nice, I have created some slides for you, including talking points in the notes section. Download them here. Comments about this post by Daniel Moth welcome at the original blog.

    Read the article

  • SQL Server Optimizer Malfunction?

    - by Tony Davis
    There was a sharp intake of breath from the audience when Adam Machanic declared the SQL Server optimizer to be essentially "stuck in 1997". It was during his fascinating "Query Tuning Mastery: Manhandling Parallelism" session at the recent PASS SQL Summit. Paraphrasing somewhat, Adam (blog | @AdamMachanic) offered a convincing argument that the optimizer often delivers flawed plans based on assumptions that are no longer valid with today’s hardware. In 1997, when Microsoft engineers re-designed the database engine for SQL Server 7.0, SQL Server got its initial implementation of a cost-based optimizer. Up to SQL Server 2000, the developer often had to deploy a steady stream of hints in SQL statements to combat the occasionally wilful plan choices made by the optimizer. However, with each successive release, the optimizer has evolved and improved in its decision-making. It is still prone to the occasional stumble when we tackle difficult problems, join large numbers of tables, perform complex aggregations, and so on, but for most of us, most of the time, the optimizer purrs along efficiently in the background. Adam, however, challenged further any assumption that the current optimizer is competent at providing the most efficient plans for our more complex analytical queries, and in particular of offering up correctly parallelized plans. He painted a picture of a present where complex analytical queries have become ever more prevalent; where disk IO is ever faster so that reads from disk come into buffer cache faster than ever; where the improving RAM-to-data ratio means that we have a better chance of finding our data in cache. Most importantly, we have more CPUs at our disposal than ever before. To get these queries to perform, we not only need to have the right indexes, but also to be able to split the data up into subsets and spread its processing evenly across all these available CPUs. Improvements such as support for ColumnStore indexes are taking things in the right direction, but, unfortunately, deficiencies in the current Optimizer mean that SQL Server is yet to be able to exploit properly all those extra CPUs. Adam’s contention was that the current optimizer uses essentially the same costing model for many of its core operations as it did back in the days of SQL Server 7, based on assumptions that are no longer valid. One example he gave was a "slow disk" bias that may have been valid back in 1997 but certainly is not on modern disk systems. Essentially, the optimizer assesses the relative cost of serial versus parallel plans based on the assumption that there is no IO cost benefit from parallelization, only CPU. It assumes that a single request will saturate the IO channel, and so a query would not run any faster if we parallelized IO because the disk system simply wouldn’t be able to handle the extra pressure. As such, the optimizer often decides that a serial plan is lower cost, often in cases where a parallel plan would improve performance dramatically. It was challenging and thought provoking stuff, as were his techniques for driving parallelism through query logic based on subsets of rows that define the "grain" of the query. I highly recommend you catch the session if you missed it. I’m interested to hear though, when and how often people feel the force of the optimizer’s shortcomings. Barring mistakes, such as stale statistics, how often do you feel the Optimizer fails to find the plan you think it should, and what are the most common causes? Is it fighting to induce it toward parallelism? Combating unexpected plans, arising from table partitioning? Something altogether more prosaic? Cheers, Tony.

    Read the article

  • About the K computer

    - by nospam(at)example.com (Joerg Moellenkamp)
    Okay ? after getting yet another mail because of the new #1 on the Top500 list, I want to add some comments from my side: Yes, the system is using SPARC processor. And that is great news for a SPARC fan like me. It is using the SPARC VIIIfx processor from Fujitsu clocked at 2 GHz. No, it isn't the only one. Most people are saying there are two in the Top500 list using SPARC (#77 JAXA and #1 K) but in fact there are three. The Tianhe-1 (#2 on the Top500 list) super computer contains 2048 Galaxy "FT-1000" 1 GHz 8-core processors. Don't know it? The FeiTeng-1000 ? this proc is a 8 core, 8 threads per core, 1 ghz processor made in China. And it's SPARC based. By the way ? this sounds really familiar to me ? perhaps the people just took the opensourced UltraSPARC-T2 design, because some of the parameters sound just to similar. However it looks like that Tianhe-1 is using the SPARCs as input nodes and not as compute notes. No, I don't see it as the next M-series processor. Simple reason: You can't create SMP systems out of them ? it simply hasn't the functionality to do so. Even when there are multiple CPUs on a single board, they are not connected like an SMP/NUMA machine to a shared memory machine ? they are connected with the cluster interconnect (in this case the Tofu interconnect) and work like a large cluster. Yes, it has a lot of oomph in Linpack ? however I assume a lot came from the extensions to the SPARCv9 standard. No, Linpack has no relevance for any commercial workload ? Linpack is such a special load, that even some HPC people are arguing that it isn't really a good benchmark for HPC. It's embarrassingly parallel, it can work with relatively small interconnects compared to the interconnects in SMP systems (however we get in spheres SMP interconnects where a few years ago). Amdahl isn't hitting that hard when running Linpack. Yes, it's a good move to use SPARC. At some time in the last 10 years, there was an interesting twist in perception: SPARC was considered as proprietary architecture and x86 was the open architecture. However it's vice versa ? try to create a x86 clone and you have a lot of intellectual property problems, create a SPARC clone and you have to spend 100 bucks or so to get the specification from the SPARC Foundation and develop your own SPARC processor. Fujitsu is doing this for a long time now. So they had their own processor, their own know-how. So why was SPARC a good choice? Well ? essentially Fujitsu can do what they want with their core as it is their core, for example adding the extensions to the SPARCv9 chipset ? getting Intel to create extensions to x86 to help you with your product is a little bit harder. So Fujitsu could do they needed to do with their processor in order to create such a supercomputer. No, the K is really using no FPGA or GPU as accelerators. The K is really using the CPU at doing this job. Yes, it has a significantly enhanced FPU capable to execute 8 instructions in parallel. No, it doesn't run Solaris. Yes, it uses Linux. No, it doesn't hurt me ... as my colleague Roland Rambau (he knows a lot about HPC) said once to me ... it doesn't matter which OS is staying out of the way of the workload in HPC.

    Read the article

  • Faster Memory Allocation Using vmtasks

    - by Steve Sistare
    You may have noticed a new system process called "vmtasks" on Solaris 11 systems: % pgrep vmtasks 8 % prstat -p 8 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 8 root 0K 0K sleep 99 -20 9:10:59 0.0% vmtasks/32 What is vmtasks, and why should you care? In a nutshell, vmtasks accelerates creation, locking, and destruction of pages in shared memory segments. This is particularly helpful for locked memory, as creating a page of physical memory is much more expensive than creating a page of virtual memory. For example, an ISM segment (shmflag & SHM_SHARE_MMU) is locked in memory on the first shmat() call, and a DISM segment (shmflg & SHM_PAGEABLE) is locked using mlock() or memcntl(). Segment operations such as creation and locking are typically single threaded, performed by the thread making the system call. In many applications, the size of a shared memory segment is a large fraction of total physical memory, and the single-threaded initialization is a scalability bottleneck which increases application startup time. To break the bottleneck, we apply parallel processing, harnessing the power of the additional CPUs that are always present on modern platforms. For sufficiently large segments, as many of 16 threads of vmtasks are employed to assist an application thread during creation, locking, and destruction operations. The segment is implicitly divided at page boundaries, and each thread is given a chunk of pages to process. The per-page processing time can vary, so for dynamic load balancing, the number of chunks is greater than the number of threads, and threads grab chunks dynamically as they finish their work. Because the threads modify a single application address space in compressed time interval, contention on locks protecting VM data structures locks was a problem, and we had to re-scale a number of VM locks to get good parallel efficiency. The vmtasks process has 1 thread per CPU and may accelerate multiple segment operations simultaneously, but each operation gets at most 16 helper threads to avoid monopolizing CPU resources. We may reconsider this limit in the future. Acceleration using vmtasks is enabled out of the box, with no tuning required, and works for all Solaris platform architectures (SPARC sun4u, SPARC sun4v, x86). The following tables show the time to create + lock + destroy a large segment, normalized as milliseconds per gigabyte, before and after the introduction of vmtasks: ISM system ncpu before after speedup ------ ---- ------ ----- ------- x4600 32 1386 245 6X X7560 64 1016 153 7X M9000 512 1196 206 6X T5240 128 2506 234 11X T4-2 128 1197 107 11x DISM system ncpu before after speedup ------ ---- ------ ----- ------- x4600 32 1582 265 6X X7560 64 1116 158 7X M9000 512 1165 152 8X T5240 128 2796 198 14X (I am missing the data for T4 DISM, for no good reason; it works fine). The following table separates the creation and destruction times: ISM, T4-2 before after ------ ----- create 702 64 destroy 495 43 To put this in perspective, consider creating a 512 GB ISM segment on T4-2. Creating the segment would take 6 minutes with the old code, and only 33 seconds with the new. If this is your Oracle SGA, you save over 5 minutes when starting the database, and you also save when shutting it down prior to a restart. Those minutes go directly to your bottom line for service availability.

    Read the article

  • StreamInsight 2.1 Released

    - by Roman Schindlauer
    The wait is over—we are pleased to announce the release of StreamInsight 2.1. Since the release of version 1.2, we have heard your feedbacks and suggestions and based on that we have come up with a whole new set of features. Here are some of the highlights: A New Programming Model – A more clear and consistent object model, eliminating the need for complex input and output adapters (though they are still completely supported). This new model allows you to provision, name, and manage data sources and sinks in the StreamInsight server. Tight integration with Reactive Framework (Rx) – You can write reactive queries hosted inside StreamInsight as well as compose temporal queries on reactive objects. High Availability – Check-pointing over temporal streams and multiple processes with shared computation. Here is how simple coding can be with the 2.1 Programming Model: class Program {     static void Main(string[] args)     {         using (Server server = Server.Create("Default"))         {             // Create an app             Application app = server.CreateApplication("app");             // Define a simple observable which generates an integer every second             var source = app.DefineObservable(() =>                 Observable.Interval(TimeSpan.FromSeconds(1)));             // Define a sink.             var sink = app.DefineObserver(() =>                 Observer.Create<long>(x => Console.WriteLine(x)));             // Define a query to filter the events             var query = from e in source                         where e % 2 == 0                         select e;             // Bind the query to the sink and create a runnable process             using (IDisposable proc = query.Bind(sink).Run("MyProcess"))             {                 Console.WriteLine("Press a key to dispose the process...");                 Console.ReadKey();             }         }     } }   That’s how easily you can define a source, sink and compose a query and run it. Note that we did not replace the existing APIs, they co-exist with the new surface. Stay tuned, you will see a series of articles coming out over the next few weeks about the new features and how to use them. Come and grab it from our download center page and let us know what you think! You can find the updated MSDN documentation here, and we would appreciate if you could provide feedback to the docs as well—best via email to [email protected]. Moreover, we updated our samples to demonstrate the new programming surface. Regards, The StreamInsight Team

    Read the article

  • StreamInsight 2.1 Released

    - by Roman Schindlauer
    The wait is over—we are pleased to announce the release of StreamInsight 2.1. Since the release of version 1.2, we have heard your feedbacks and suggestions and based on that we have come up with a whole new set of features. Here are some of the highlights: A New Programming Model – A more clear and consistent object model, eliminating the need for complex input and output adapters (though they are still completely supported). This new model allows you to provision, name, and manage data sources and sinks in the StreamInsight server. Tight integration with Reactive Framework (Rx) – You can write reactive queries hosted inside StreamInsight as well as compose temporal queries on reactive objects. High Availability – Check-pointing over temporal streams and multiple processes with shared computation. Here is how simple coding can be with the 2.1 Programming Model: class Program {     static void Main(string[] args)     {         using (Server server = Server.Create("Default"))         {             // Create an app             Application app = server.CreateApplication("app");             // Define a simple observable which generates an integer every second             var source = app.DefineObservable(() =>                 Observable.Interval(TimeSpan.FromSeconds(1)));             // Define a sink.             var sink = app.DefineObserver(() =>                 Observer.Create<long>(x => Console.WriteLine(x)));             // Define a query to filter the events             var query = from e in source                         where e % 2 == 0                         select e;             // Bind the query to the sink and create a runnable process             using (IDisposable proc = query.Bind(sink).Run("MyProcess"))             {                 Console.WriteLine("Press a key to dispose the process...");                 Console.ReadKey();             }         }     } }   That’s how easily you can define a source, sink and compose a query and run it. Note that we did not replace the existing APIs, they co-exist with the new surface. Stay tuned, you will see a series of articles coming out over the next few weeks about the new features and how to use them. Come and grab it from our download center page and let us know what you think! You can find the updated MSDN documentation here, and we would appreciate if you could provide feedback to the docs as well—best via email to [email protected]. Moreover, we updated our samples to demonstrate the new programming surface. Regards, The StreamInsight Team

    Read the article

  • Is Rails Metal (& Rack) a good way to implement a high traffic web service api?

    - by Greg
    I am working on a very typical web application. The main component of the user experience is a widget that a site owner would install on their front page. Every time their front page loads, the widget talks to our server and displays some of the data that returns. So there are two components to this web application: the front end UI that the site owner uses to configure their widget the back end component that responds to the widget's web api call Previously we had all of this running in PHP. Now we are experimenting with Rails, which is fantastic for #1 (the front end UI). The question is how to do #2, the back serving of widget information, efficiently. Obviously this is much higher load than the front end, since it is called every time the front page loads on one of our clients' websites. I can see two obvious approaches: A. Parallel Stack: Set up a parallel stack that uses something other than rails (e.g. our old PHP-based approach) but accesses the same database as the front end B. Rails Metal: Use Rails Metal/Rack to bypass the Rails routing mechanism, but keep the api call responder within the Rails app My main question: Is Rails/Metal a reasonable approach for something like this? But also... Will the overhead of loading the Rails environment still be too heavy? Is there a way to get even closer to the metal with Rails, bypassing most of the environment? Will Rails/Metal performance approach the perf of a similar task on straight PHP (just looking for ballpark here)? And... Is there a 'C' option that would be much better than both A and B? That is, something before going to the lengths of C code compiled to binary and installed as an nginx or apache module? Thanks in advance for any insights.

    Read the article

  • solve a classic map-reduce problem with opencl?

    - by liuliu
    I am trying to parallel a classic map-reduce problem (which can parallel well with MPI) with OpenCL, namely, the AMD implementation. But the result bothers me. Let me brief about the problem first. There are two type of data that flow into the system: the feature set (30 parameters for each) and the sample set (9000+ dimensions for each). It is a classic map-reduce problem in the sense that I need to calculate the score of every feature on every sample (Map). And then, sum up the overall score for every feature (Reduce). There are around 10k features and 30k samples. I tried different ways to solve the problem. First, I tried to decompose the problem by features. The problem is that the score calculation consists of random memory access (pick some of the 9000+ dimensions and do plus/subtraction calculations). Since I cannot coalesce memory access, it costs. Then, I tried to decompose the problem by samples. The problem is that to sum up overall score, all threads are competing for few score variables. It keeps overwriting the score which turns out to be incorrect. (I cannot carry out individual score first and sum up later because it requires 10k * 30k * 4 bytes). The first method I tried gives me the same performance on i7 860 CPU with 8 threads. However, I don't think the problem is unsolvable: it is remarkably similar to ray tracing problem (for which you carry out calculation that millions of rays against millions of triangles). Any ideas?

    Read the article

  • animation extender in datalist control in asp.net 2008

    - by BibiBuBu
    Good Day! i have a question that how can i use animation extender in datalist control in asp.net with c#. i want the animation when i click the delete button (delete button will be in repeater). so that when i remove one record then it shows animation to bring the next record. it is in update panel. <cc1:AnimationExtender ID="AnimationExtender1" runat="server" Enabled="True" TargetControlID="btnDeleteId"> <Animations> <OnClick> <Sequence> <EnableAction Enabled="false" /> <Parallel Duration=".2"> <Resize Height="0" Width="0" Unit="px" /> <FadeOut /> </Parallel> <HideAction /> </Sequence> </OnClick> </Animations> </cc1:AnimationExtender> now if i put my button id in the Target control id then it gives error that it should not be in same update panel etc... but over all nothing working for animation. i am binding my datalist in itemDataBound....e.g. ImageButton imgbtn = (ImageButton)e.Item.FindControl("imgBtnPic"); Label lblAvatar = (Label)e.Item.FindControl("lblAvatar"); LinkButton lbName = (LinkButton)e.Item.FindControl("lbtnName"); Can somebody please suggest me something. thanks

    Read the article

  • Spatial Rotation in Gmod Expression2.

    - by Fascia
    I'm using expression2 to program behavior in Garry's mod (http://wiki.garrysmod.com/?title=Wire_Expression2) Okay so, to set the precedent. In Gmod I have a block and I am at a complete loss of how to get it to rotate around the 3 up, down and right vectors (Which are local. ie; if I pitch it 45 degrees the forward vector is 0.707, 0.707, 0). Essentially, From the 3 vectors I'd like to be able to get local Pitch/Roll/Yaw. By Local Pitch Roll Yaw I mean that they are completely independent of one another allowing true 3d rotation. So for example; if I place my craft so its nose is parallel to the floor the X,Y,Z would be 0,0,0. If I turn it parallel to the floor (World and Local Yaw) 90 degrees it's now 0, 0, 90. If I then pitch it (World Roll, Local Pitch) it 180 degrees it's now 180, 0, 90. I've already explored quaternions however I don't believe I should post my code here as I think I was re-inventing the wheel. I know I didn't explain that well but I believe the problem is pretty generic. Any help anyone could offer is greatly appreciated. Oh, I'd like to avoid gimblelock too. Essentially calculating the rotation around each of the crafts up/forward/right vectors using the up/forward/right vectors. To simply the question a generic implementation rather than one specific to Gmod is absolutely fine.

    Read the article

  • Getting browser to make an AJAX call ASAP, while page is still loading

    - by Chris
    I'm looking for tips on how to get the browser to kick off an AJAX call as soon as possible into processing a web page, ideally before the page is fully downloaded. Here's my approximate motivation: I have a web page that takes, say, 5 seconds to load. It calls a web service that takes, say, 10 seconds to load. If loading the page and calling the web service happened sequentially, the user would have to wait 15 seconds to see all the information. However, if I can get the web service call started before the 5 second page load is complete, then at least some of the work can happened in parallel. Ideally I'd like as much of the work to happen in parallel as possible. My initial theory was that I should place the AJAX-calling javascript as high up as possible in the web page HTML source (being mindful of putting it after the jquery.js include, because I'm making the call using jquery ajax). I'm also being mindful not to wrap the AJAX call in a jquery ready event handler. (I mention this because ready events are popular in a lot of jquery example code.) However, the AJAX call still doesn't seem to get kicked off as early as I'm hoping (at least as judged by the Google Chrome "Timeline" feature), so I'm wondering what other considerations apply. One thing that might potentially be detrimental is that the AJAX call is back to the same web server that's serving the original web page, so I might be in danger of hitting a browser limit on the # of HTTP connections back to that one machine? (The HTML page loads a number of images, css files, etc..)

    Read the article

  • P6 Architecture - Register renaming aside, does the limited user registers result in more ops spent

    - by mrjoltcola
    I'm studying JIT design with regard to dynamic languages VM implementation. I haven't done much Assembly since the 8086/8088 days, just a little here or there, so be nice if I'm out of sorts. As I understand it, the x86 (IA-32) architecture still has the same basic limited register set today that it always did, but the internal register count has grown tremendously, but these internal registers are not generally available and are used with register renaming to achieve parallel pipelining of code that otherwise could not be parallelizable. I understand this optimization pretty well, but my feeling is, while these optimizations help in overall throughput and for parallel algorithms, the limited register set we are still stuck with results in more register spilling overhead such that if x86 had double, or quadruple the registers available to us, there may be significantly less push/pop opcodes in a typical instruction stream? Or are there other processor optmizations that also optimize this away that I am unaware of? Basically if I've a unit of code that has 4 registers to work with for integer work, but my unit has a dozen variables, I've got potentially a push/pop for every 2 or so instructions. Any references to studies, or better yet, personal experiences?

    Read the article

  • How can I improve my real-time behavior in multi-threaded app using pthreads and condition variables

    - by WilliamKF
    I have a multi-threaded application that is using pthreads. I have a mutex() lock and condition variables(). There are two threads, one thread is producing data for the second thread, a worker, which is trying to process the produced data in a real time fashion such that one chuck is processed as close to the elapsing of a fixed time period as possible. This works pretty well, however, occasionally when the producer thread releases the condition upon which the worker is waiting, a delay of up to almost a whole second is seen before the worker thread gets control and executes again. I know this because right before the producer releases the condition upon which the worker is waiting, it does a chuck of processing for the worker if it is time to process another chuck, then immediately upon receiving the condition in the worker thread, it also does a chuck of processing if it is time to process another chuck. In this later case, I am seeing that I am late processing the chuck many times. I'd like to eliminate this lost efficiency and do what I can to keep the chucks ticking away as close to possible to the desired frequency. Is there anything I can do to reduce the delay between the release condition from the producer and the detection that that condition is released such that the worker resumes processing? For example, would it help for the producer to call something to force itself to be context switched out? Bottom line is the worker has to wait each time it asks the producer to create work for itself so that the producer can muck with the worker's data structures before telling the worker it is ready to run in parallel again. This period of exclusive access by the producer is meant to be short, but during this period, I am also checking for real-time work to be done by the producer on behalf of the worker while the producer has exclusive access. Somehow my hand off back to running in parallel again results in significant delay occasionally that I would like to avoid. Please suggest how this might be best accomplished.

    Read the article

< Previous Page | 273 274 275 276 277 278 279 280 281 282 283 284  | Next Page >