Search Results

Search found 1731 results on 70 pages for 'daniel rose'.

Page 3/70 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

Responding to Invites

- by Daniel Moth

Following up from my post about Sending Outlook Invites here is a shorter one on how to respond. Whatever your choice (ACCEPT, TENTATIVE, DECLINE), if the sender has not unchecked the "Request Response" option, then send your response. Always send your response. Even if you think the sender made a mistake in keeping it on, send your response. Seriously, not responding is plain rude. If you knew about the meeting, and you are happy investing your time in it, and the time and location work for you, and there is an implicit/explicit agenda, then ACCEPT and send it. If one or more of those things don't work for you then you have a few options. Send a DECLINE explaining why. Reply with email to ask for further details or for a change to be made. If you don’t receive a response to your email, send a DECLINE when you've waited enough. Send a TENTATIVE if you haven't made up your mind yet. Hint: if they really require you there, they'll respond asking "why tentative" and you have a discussion about it. When you deem appropriate, instead of the options above, you can also use the counter propose feature of Outlook but IMO that feature has questionable interaction model and UI (on both sender and recipient) so many people get confused by it. BTW, two of my outlook rules are relevant to invites. The first one auto-marks as read the ACCEPT responses if there is no comment in the body of the accept (I check later who has accepted and who hasn't via the "Tracking" button of the invite). I don’t have a rule for the DECLINE and TENTATIVE cause typically I follow up with folks that send those. The second rule ensures that all Invites go to a specific folder. That is the first folder I see when I triage email. It is also the only folder which I have configured to show a count of all items inside it, rather than the unread count - when sending a response to an invite the item disappears from the folder and hence it is empty and not nagging me. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Live Debugging

- by Daniel Moth

Based on my classification of diagnostics, you should know what live debugging is NOT about - at least according to me :-) and in this post I'll share how I think of live debugging. These are the (outer) steps to live debugging Get the debugger in the picture. Control program execution. Inspect state. Iterate between 2 and 3 as necessary. Stop debugging (and potentially start new iteration going back to step 1). Step 1 has two options: start with the debugger attached, or execute your binary separately and attach the debugger later. You might say there is a 3rd option, where the app notifies you that there is an issue, referred to as JIT debugging. However, that is just a variation of the attach because that is when you start the debugging session: when you attach. I'll be covering in future posts how this step works in Visual Studio. Step 2 is about pausing (or breaking) your app so that it makes no progress and remains "frozen". A sub-variation is to pause only parts of its execution, or in other words to freeze individual threads. I'll be covering in future posts the various ways you can perform this step in Visual Studio. Step 3, is about seeing what the state of your program is when you have paused it. Typically it involves comparing the state you are finding, with a mental picture of what you thought the state would be. Or simply checking invariants about the intended state of the app, with the actual state of the app. I'll be covering in future posts the various ways you can perform this step in Visual Studio. Step 4 is necessary if you need to inspect more state - rinse and repeat. Self-explanatory, and will be covered as part of steps 2 & 3. Step 5 is the most straightforward, with 3 options: Detach the debugger; terminate your binary though the normal way that it terminates (e.g. close the main window); and, terminate the debugging session through your debugger with a result that it terminates the execution of your program too. In a future post I'll cover the ways you can detach or terminate the debugger in Visual Studio. I found an old picture I used to use to map the steps above on Visual Studio 2010. It is basically the Debug menu with colored rectangles around each menu mapping the menu to one of the first 3 steps (step 5 was merged with step 1 for that slide). Here it is in case it helps: Stay tuned for more... Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Attach to Process in Visual Studio

- by Daniel Moth

One option for achieving step 1 in the Live Debugging process is attaching to an already running instance of the process that hosts your code, and this is a good place for me to talk about debug engines. You can attach to a process by selecting the "Debug" menu and then the "Attach To Process…" menu in Visual Studio 11 (Ctrl+Alt+P with my keyboard bindings), and you should see something like this screenshot: I am not going to explain this UI, besides being fairly intuitive, there is good documentation on MSDN for the Attach dialog. I do want to focus on the row of controls that starts with the "Attach to:" label and ends with the "Select..." button. Between them is the readonly textbox that indicates the debug engine that will be used for the selected process if you click the "Attach" button. If you haven't encountered that term before, read on MSDN about debug engines. Notice that the "Type" column shows the Code Type(s) that can be detected for the process. Typically each debug engine knows how to debug a specific code type (the two terms tend to be used interchangeably). If you click on a different process in the list with a different code type, the debug engine used will be different. However note that this is the automatic behavior. If you believe you know best, or more typically you want to choose the debug engine for a process using more than one code type, you can do so by clicking the "Select..." button, which should yield a "Select Code Type" dialog like this one: In this dialog you can switch to the debug engine you want to use by checking the box in front of your desired one, then hit "OK", then hit "Attach" to use it. Notice that the dialog suggests that you can select more than one. Not all combinations work (you'll get an error if you select two incompatible debug engines), but some do. Also notice in the list of debug engines one of the new players in Visual Studio 11, the GPU debug engine - I will be covering that on the C++ AMP team blog (and no, it cannot be combined with any others in this release). Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Running C++ AMP kernels on the CPU

- by Daniel Moth

One of the FAQs we receive is whether C++ AMP can be used to target the CPU. For targeting multi-core we have a technology we released with VS2010 called PPL, which has had enhancements for VS 11 – that is what you should be using! FYI, it also has a Linux implementation via Intel's TBB which conforms to the same interface. When you choose to use C++ AMP, you choose to take advantage of massively parallel hardware, through accelerators like the GPU. Having said that, you can always use the accelerator class to check if you are running on a system where the is no hardware with a DirectX 11 driver, and decide what alternative code path you wish to follow. In fact, if you do nothing in code, if the runtime does not find DX11 hardware to run your code on, it will choose the WARP accelerator which will run your code on the CPU, taking advantage of multi-core and SSE2 (depending on the CPU capabilities WARP also uses SSE3 and SSE 4.1 – it does not currently use AVX and on such systems you hopefully have a DX 11 GPU anyway). A few things to know about WARP It is our fallback CPU solution, not intended as a primary target of C++ AMP. WARP stands for Windows Advanced Rasterization Platform and you can read old info on this MSDN page on WARP. What is new in Windows 8 Developer Preview is that WARP now supports DirectCompute, which is what C++ AMP builds on. It is not currently clear if we will have a CPU fallback solution for non-Windows 8 platforms when we ship. When you create a WARP accelerator, its is_emulated property returns true. WARP does not currently support double precision. BTW, when we refer to WARP, we refer to this accelerator described above. If we use lower case "warp", that refers to a bunch of threads that run concurrently in lock step and share the same instruction. In the VS 11 Developer Preview, the size of warp in our Ref emulator is 4 – Ref is another emulator that runs on the CPU, but it is extremely slow not intended for production, just for debugging. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
OOF checklist

- by Daniel Moth

When going on vacation or otherwise being out of office (known as OOF in Microsoft), it is polite and professional that our absence creates the minimum disruption possible to the rest of the business, and especially our colleagues. Below is my OOF checklist - I try to do these as soon as I know I'll be OOF, rather than leave it for the night before. Let the relevant folks on the team know the planned dates of absence and check if anybody was expecting something from you during that timeframe. Reset expectations with them, and as applicable try to find another owner for individual activities that cannot wait. Go through your calendar for the OOF period and decline every meeting occurrence so the owner of the meeting knows that you won't be attending (similar to my post about responding to invites). If it is your meeting cancel it so that people don’t turn up without the meeting organizer being there. Do this even for meetings were the folks should know due to step #1. Over-communicating is a good thing here and keeps calendars all around up to date. Enter your OOF dates in whatever tool your company uses. Typically that is the notification to your manager. In your Outlook calendar, create a local Appointment (don't invite anyone) for the date range (All day event) setting the "Show As" dropdown to "Out of Office". This way, people won’t try to schedule meetings with you on that day. If you use Lync, set the status to "Off Work" for that period. If you won't be responding to email (which when on your vacation you definitely shouldn't) then in Outlook setup "Automatic Replies (Out of Office)" for that period. This way people won’t think you are rude when not replying to their emails. In your OOF message point to an alias (ideally of many people) as a fallback for urgent queries. If you want to proactively notify individuals of your OOFage then schedule and send a multi-day meeting request for the entire period. Remember to set the "Show As" to "Free" (so their calendar doesn’t show busy/oof to others), set the "Reminder" to "None" (so they don’t get a reminder about it), set "Low Importance", and uncheck both "Response Options" so if they don't want this on their calendar, it is just one click for them to get rid of it. Aside: I have another post with advice on sending invites. If you care about people who would not observe the above but could drop by your office, stick a physical OOF note at your office door or chair/monitor or desk. Have I missed any? Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Perfect is the enemy of “Good Enough”

- by Daniel Moth

This is one of the quotes that I was against, but now it is totally part of my core beliefs: "Perfect is the enemy of Good Enough" Folks used to share this quote a lot with me in my early career and my frequent interpretation was that they were incompetent people that were satisfied with mediocrity, i.e. I ignored them and their advice. (Yes, I went through an arrogance phase). I later "grew up" and "realized" that they were missing the point, so instead of ignoring them I would retort: "Of course we have to aim for perfection, because as human beings we'll never achieve perfection, so by aiming for perfection we will indeed achieve good enough results". (Yes, I went through a smart ass phase). Later I grew up a bit more and "understood" that what I was really being told is to finish my work earlier and move on to other things because by trying to perfect that one thing, another N things that I was responsible for were suffering by not getting my attention - all things on my plate need to move beyond the line, not just one of them to go way over the line. It is really a statement of increasing scale and scope. To put it in other words, getting PASS grades on 10 things is better than getting an A+ with distinction on 1-2 and a FAIL on the rest. Instead of saying “I am able to do very well these X items” it is best if you can say I can do well enough on these X * Y items”, where Y > 1. That is how breadth impact is achieved. In the future, I may grow up again and have a different interpretation, but for now - even though I secretly try to "perfect" things, I try not to do that at the expense of other responsibilities. This means that I haven't had anybody quote that saying to me in a while (or perhaps my quality of work has dropped so much that it doesn't apply to me any more - who knows :-)). Wikipedia attributes the quote to Voltaire and it also makes connections to the “Law of diminishing returns”, and to the “80-20 rule” or “Pareto principle”… it commonly takes 20% of the full time to complete 80% of a task while to complete the last 20% of a task takes 80% of the effort …check out the Wikipedia entry on “Perfect is the enemy of Good” and its links. Also use your favorite search engine to search and see what others are saying (Bing, Google) – it is worth internalizing this in a way that makes sense to you… Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Link instead of Attaching

- by Daniel Moth

With email storage not being an issue in many companies (I think I currently have 25GB of storage on my email account, I don’t even think about storage), this encourages bad behaviors such as liberally attaching office documents to emails instead of sharing a link to the document in SharePoint or SkyDrive or some file share etc. Attaching a file admittedly has its usage scenarios too, but it should not be the default. I thought I'd list the reasons why sharing a link can be better than attaching files directly. In no particular order: Better Review. It allows multiple recipients to review the file and their comments are aggregated into a single document. The alternative is everyone having to detach the document, add their comments, then send back to you, and then you have to collate. Wirth the alternative, you also potentially miss out on recipients reading comments from other recipients. Always up to date. The attachment becomes a fork instead of an always up to date document. For example, you send the email on Thursday, I only open it on Tuesday: between those days you could have made updates that now I am missing because you decided to share a link instead of an attachment. Better bookmarking. When I need to find that document you shared, you are forcing me to search through my email (I may not even be running outlook), instead of opening the link which I have bookmarked in my browser or my collection of links in my OneNote or from the recent/pinned links of the office app on my task bar, etc. Can control access. If someone accidentally or naively forwards your link to someone outside your group/org who you’d prefer not to have access to it, the location of the document can be protected with specific access control. Can add more recipients. If someone adds people to the email thread in outlook, your attachment doesn't get re-attached - instead, the person added is left without the attachment unless someone remembers to re-attach it. If it was a link, they are immediately caught up without further actions. Enable Discovery. If you put it on a share, I may be able to discover other cool stuff that lives alongside that document. Save on storage. So this doesn't apply to me given my opening statement, but if in your company you do have such limitations, attaching files eats up storage on all recipients accounts and will also get "lost" when those people archive email (and lose completely at some point if they follow the company retention policy). Like I said, attachments do have their place, but they should be an explicit choice for explicit reasons rather than the default. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
GPU Debugging with VS 11

- by Daniel Moth

With VS 11 Developer Preview we have invested tremendously in parallel debugging for both CPU (managed and native) and GPU debugging. I'll be doing a whole bunch of blog posts on those topics, and in this post I just wanted to get people started with GPU debugging, i.e. with debugging C++ AMP code. First I invite you to watch 6 minutes of a glimpse of the C++ AMP debugging experience though this video (ffw to minute 51:54, up until minute 59:16). Don't read the rest of this post, just go watch that video, ideally download the High Quality WMV. Summary GPU debugging essentially means debugging the lambda that you pass to the parallel_for_each call (plus any functions you call from the lambda, of course). CPU debugging means debugging all the code above and below the parallel_for_each call, i.e. all the code except the restrict(direct3d) lambda and the functions that it calls. With VS 11 you have to choose what debugger you want to use for a particular debugging session, CPU or GPU. So you can place breakpoints all over your code, then choose what debugger you want (CPU or GPU), and you'll only be able to hit breakpoints for the code type that the debugger engine understands – the remaining breakpoints will appear as unbound. If you want to hit the unbound breakpoints, you'd have to stop debugging, and start again with the other debugger. Sorry. We suck. We know. But once you are past that limitation, I think you'll find the experience truly rewarding – seriously! Switching debugger engines With the Developer Preview bits, one way to switch the debugger engine is through the project properties – see the screenshots that follow. This one is showing the CPU option selected, which is basically the default that you are all familiar with: This screenshot is showing the GPU option selected, by changing the debugger launcher (notice that this applies for both the local and remote case): You actually do not have to open the project properties just for switching the debugger engine, you can switch the selection from the toolbar in VS 11 Developer Preview too – see following screenshot (the effect is the same as if you opened the project properties and switched there) Breakpoint behavior Here are two screenshots, one showing a debugging session for CPU and the other a debugging session for GPU (notice the unbound breakpoints in each case) …and here is the GPU case (where we cannot bind the CPU breakpoints but can the GPU breakpoint, which is actually hit) Give C++ AMP debugging a try So to debug your C++ AMP code, pull down the drop down under the 'play' button to select the 'GPU C++ Direct3D Compute Debugger' menu option, then hit F5 (or the 'play' button itself). Then you can explore debugging by exploring the menus under the Debug and under the Debug->Windows menus. One way to do that exploration is through the C++ AMP debugging walkthrough on MSDN. Another way to explore the C++ AMP debugging experience, you can use the moth.cpp code file, which is what I used in my BUILD session debugger demo. Note that for my demo I was using the latest internal VS11 bits, so your experience with the Developer Preview bits won't be identical to what you saw me demonstrate, but it shouldn't be far off. Stay tuned for a lot more content on the parallel debugger in VS 11, both CPU and GPU, both managed and native. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
asynchrony is viral

- by Daniel Moth

It is becoming hard to write code today without introducing some form of asynchrony and, if you are using .NET (e.g. for Windows Phone 8 or Windows Store apps), that means sooner or later you have to await something and mark your method as async. My most recent examples included introducing speech recognition in my Translator By Moth phone app where I had to await mySpeechRecognizerUI.RecognizeWithUIAsync() and when moving that code base to a Windows Store project just to show a MessageBox I had to await myMessageDialog.ShowAsync(). Any time you need to invoke an asynchronous method in your code, you have a choice to make: kick off the operation but don’t wait for it to complete (otherwise known as fire-and-forget), synchronously wait for it to complete (which will entail blocking, which can be bad, especially on a UI thread), or asynchronously wait for it to complete before continuing on with the rest of the method’s work. In most cases, you want the latter, and the await keyword makes that trivial to implement. When you use the magical await keyword in front of an API call, then you typically have to make additional changes to your code: This await usage is within a method of course, and now you have to annotate that method with async. Furthermore, you have to change the return type of the method you just annotated so it returns a Task (if it previously returned void), or Task<myOldReturnType> (if it previously returned myOldReturnType). Note that if it returns void, in some cases you could cheat and stop there. Furthermore, any method that called this method you just annotated with async will now also be invoking an asynchronous operation, so you have to make that change in the body of the caller method to introduce the await keyword before the call to the method. …you guessed it, you now have to change this caller method to be annotated with async and have its return types tweaked... …and it goes on virally… At some point you reach the root of your user code, e.g. a GUI event handler, and whoever calls that void method can already deal with the fact that you marked it as async and the viral introduction of the keywords stops there… This is all wonderful progress and a very powerful mechanism, and I just wish someone had written a refactoring tool to take care of this… anyone? I mentioned earlier that you have a choice when invoking an asynchronous operation. If the first time you encounter this you wish to localize the impact of all these changes and essentially try to turn the asynchronous behavior into synchronous by blocking - don't! For reasons why you don't want to do that, read Toub's excellent blog post (and check out the rest of his blog with gems on async programming starting with the Async FAQ). Just embrace the pattern knowing that when you use one instance of an await, you'll propagate the change all the way to the root user code method, e.g. typically an event handler. Related aside: I just finished re-writing my MessageBox wrapper class for Phone projects, including making it work in Windows Store projects, and it does expect you to use it with an await :-). I'll share that in an upcoming post for those of you that have the same need… Comments about this post by Daniel Moth welcome at the original blog.

Read the article
concurrency::accelerator_view

- by Daniel Moth

Overview We saw previously that accelerator represents a target for our C++ AMP computation or memory allocation and that there is a notion of a default accelerator. We ended that post by introducing how one can obtain accelerator_view objects from an accelerator object through the accelerator class's default_view property and the create_view method. The accelerator_view objects can be thought of as handles to an accelerator. You can also construct an accelerator_view given another accelerator_view (through the copy constructor or the assignment operator overload). Speaking of operator overloading, you can also compare (for equality and inequality) two accelerator_view objects between them to determine if they refer to the same underlying accelerator. We'll see later that when we use concurrency::array objects, the allocation of data takes place on an accelerator at array construction time, so there is a constructor overload that accepts an accelerator_view object. We'll also see later that a new concurrency::parallel_for_each function overload can take an accelerator_view object, so it knows on what target to execute the computation (represented by a lambda that the parallel_for_each also accepts). Beyond normal usage, accelerator_view is a quality of service concept that offers isolation to multiple "consumers" of an accelerator. If in your code you are accessing the accelerator from multiple threads (or, in general, from different parts of your app), then you'll want to create separate accelerator_view objects for each thread. flush, wait, and queuing_mode When you create an accelerator_view via the create_view method of the accelerator, you pass in an option of immediate or deferred, which are the two members of the queuing_mode enum. At any point you can access this value from the queuing_mode property of the accelerator_view. When the queuing_mode value is immediate (which is the default), any commands sent to the device such as kernel invocations and data transfers (e.g. parallel_for_each and copy, as we'll see in future posts), will get submitted as soon as the runtime sees fit (that is the definition of immediate). When the value of queuing_mode is deferred, the commands will be batched up. To send all buffered commands to the device for execution, there is a non-blocking flush method that you can call. If you wish to block until all the commands have been sent, there is a wait method you can call. Deferring is a more advanced scenario aimed at performance gains when you are submitting many device commands and you want to avoid the tiny overhead of flushing/submitting each command separately. Querying information Just like accelerator, accelerator_view exposes the is_debug and version properties. In fact, you can always access the accelerator object from the accelerator property on the accelerator_view class to access the accelerator interface we looked at previously. Interop with D3D (aka DX) In a later post I'll show an example of an app that uses C++ AMP to compute data that is used in pixel shaders. In those scenarios, you can benefit by integrating C++ AMP into your graphics pipeline and one of the building blocks for that is being able to use the same device context from both the compute kernel and the other shaders. You can do that by going from accelerator_view to device context (and vice versa), through part of our interop API in amp.h: *get_device, create_accelerator_view. More on those in a later post. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Visual Studio Exceptions dialogs

- by Daniel Moth

Previously I covered step 1 of live debugging with start and attach. Once the debugger is attached, you want to go to step 2 of live debugging, which is to break. One way to break under the debugger is to do nothing, and just wait for an exception to occur in your code. This is true for all types of code that you debug in Visual Studio, and let's consider the following piece of C# code:3: static void Main() 4: { 5: try 6: { 7: int i = 0; 8: int r = 5 / i; 9: } 10: catch (System.DivideByZeroException) {/*gulp. sue me.*/} 11: System.Console.ReadLine(); 12: } If you run this under the debugger do you expect an exception on line 8? It is a trick question: you have to know whether I have configured the debugger to break when exceptions are thrown (first-chance exceptions) or only when they are unhandled. The place you do that is in the Exceptions dialog which is accessible from the Debug->Exceptions menu and on my installation looks like this: Note that I have checked all CLR exceptions. I could have expanded (like shown for the C++ case in my screenshot) and selected specific exceptions. To read more about this dialog, please read the corresponding Exception Handling debugging msdn topic and all its subtopics. So, for the code above, the debugger will break execution due to the thrown exception (exactly as if the try..catch was not there), so I see the following Exception Thrown dialog: Note the following: I can hit continue (or hit break and then later continue) and the program will continue fine since I have a catch handler. If this was an unhandled exception, then that is what the dialog would say (instead of first chance exception) and continuing would crash the app. That hyperlinked text ("Open Exception Settings") opens the Exceptions dialog I described further up. The coolest thing to note is the checkbox - this is new in this latest release of Visual Studio: it is a shortcut to the checkbox in the Exceptions dialog, so you don't have to open it to change this setting for this specific exception - you can toggle that option right from this dialog. Finally, if you try the code above on your system, you may observe a couple of differences from my screenshots. The first is that you may have an additional column of checkboxes in the Exceptions dialog. The second is that the last dialog I shared may look different to you. It all depends on the Debug->Options settings, and the two relevant settings are in this screenshot: The Exception assistant is what configures the look of the UI when the debugger wants to indicate exception to you, and the Just My Code setting controls the extra column in the Exception dialog. You can read more about those options on MSDN: How to break on User-Unhandled exceptions (plus Gregg’s post) and Exception Assistant. Before I leave you to go play with this stuff a bit more, please note that this level of debugging is now available for JavaScript too, and if you are looking at the Exceptions dialog and wondering what the "GPU Memory Access Exceptions" node is about, stay tuned on the C++ AMP blog ;-) Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Give a session on C++ AMP – here is how

- by Daniel Moth

Ever since presenting on C++ AMP at the AMD Fusion conference in June, then the Gamefest conference in August, and the BUILD conference in September, I've had numerous requests about my material from folks that want to re-deliver the same session. The C++ AMP session I put together has evolved over the 3 presentations to its final form that I used at BUILD, so that is the one I recommend you base yours on. Please get the slides and the recording from channel9 (I'll refer to slide numbers below). This is how I've been presenting the C++ AMP session: Context (slide 3, 04:18-08:18) Start with a demo, on my dual-GPU machine. I've been using the N-Body sample (for VS 11 Developer Preview). (slide 4) Use an nvidia slide that has additional examples of performance improvements that customers enjoy with heterogeneous computing. (slide 5) Talk a bit about the differences today between CPU and GPU hardware, leading to the fact that these will continue to co-exist and that GPUs are great for data parallel algorithms, but not much else today. One is a jack of all trades and the other is a number cruncher. (slide 6) Use the APU example from amd, as one indication that the hardware space is still in motion, emphasizing that the C++ AMP solution is a data parallel API, not a GPU API. It has a future proof design for hardware we have yet to see. (slide 7) Provide more meta-data, as blogged about when I first introduced C++ AMP. Code (slide 9-11) Introduce C++ AMP coding with a simplistic array-addition algorithm – the slides speak for themselves. (slide 12-13) index<N>, extent<N>, and grid<N>. (Slide 14-16) array<T,N>, array_view<T,N> and comparison between them. (Slide 17) parallel_for_each. (slide 18, 21) restrict. (slide 19-20) actual restrictions of restrict(direct3d) – the slides speak for themselves. (slide 22) bring it altogether with a matrix multiplication example. (slide 23-24) accelerator, and accelerator_view. (slide 26-29) Introduce tiling incl. tiled matrix multiplication [tiling probably deserves a whole session instead of 6 minutes!]. IDE (slide 34,37) Briefly touch on the concurrency visualizer. It supports GPU profiling, but enhancements specific to C++ AMP we hope will come at the Beta timeframe, which is when I'll be spending more time talking about it. (slide 35-36, 51:54-59:16) Demonstrate the GPU debugging experience in VS 11. Summary (slide 39) Re-iterate some of the points of slide 7, and add the point that the C++ AMP spec will be open for other compiler vendors to implement, even on other platforms (in fact, Microsoft is actively working on that). (slide 40) Links to content – see slide – including where all your questions should go: http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads. "But I don't have time for a full blown session, I only need 2 (or just 1, or 3) C++ AMP slides to use in my session on related topic X" If all you want is a small number of slides, you can take some from the session above and customize them. But because I am so nice, I have created some slides for you, including talking points in the notes section. Download them here. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
What DX level does my graphics card support? Does it go to 11?

- by Daniel Moth

Recently I run into a situation that I have run into quite a few times. Someone encounters a machine and the question arises: "Is there a DirectX 11 card in this machine?". Typically the reason you are interested in that is because cards with DirectX 11 drivers fully support DirectCompute (and by extension C++ AMP) for GPGPU programming. The driver specifically is WDDM (1.1 on Windows 7 and Windows 8 introduces WDDM 1.2 with cool new capabilities). There are many ways for figuring out if you have a DirectX11 card, so here are the approaches that you can use, with a bonus right at the end of the post. Run DxDiag WindowsKey + R, type DxDiag and hit Enter. That is the DirectX diagnostic tool, which unfortunately, only tells you on the "System" tab what is the highest version of DirectX installed on your machine. So if it reports DirectX 11, that doesn't mean you have a DX11 driver! The "Display" tab has a promising "DDI version" label, but unfortunately that doesn't seem to be accurate on the machines I've tested it with (or I may be misinterpreting its use). Either way, this tool is not the one you want for this purpose, although it is good for telling you the WDDM version among other things. Use the Microsoft hardware page There is a Microsoft Windows 7 compatibility center, that lists all hardware (tip: use the advanced search) and you could try and locate your device there… good luck. Use Wikipedia or the hardware vendor's website Use the Wikipedia page for the vendor cards, for both nvidia and amd. Often this information will also be in the specifications for the cards on the IHV site, but is is nice that wikipedia has a single page per vendor that you can search etc. There is a column in the tables for API support where you can see the DirectX version. Check if it is one of these recommended DX11 cards You may not have a DirectX 11 card and are interested in purchasing one. While I am in no position to make recommendations, I will list here some cards from two big IHVs that we know are DirectX 11 capable. Some AMD (aka ATI) cards Low end, inexpensive DX11 hardware: Radeon 5450, 5550, 6450, 6570 Mid range (decent perf, single precision): Radeon 5750, 5770, 6770, 6790 High end (capable of double precision): Radeon 5850, 5870, 6950, 6970 Single precision APUs: AMD E-Series APUs AMD A-Series APUs Some NVIDIA cards Low end, inexpensive DX11 hardware: GeForce GT430, GT 440, GT520, GTS 450 Quadro 400, 600 Mid-range (decent perf, single precision): GeForce GTX 460, GTX 550 Ti, GTX 560, GTX 560 Ti Quadro 2000 High end (capable of double precision): GeForce GTX 480, GTX 570, GTX 580, GTX 590, GTX 595 Quadro 4000, 5000, 6000 Tesla C2050, C2070, C2075 Get the DirectX SDK and run DirectX Caps Viewer Download and install the June 2010 DirectX SDK. As part of that you now have the DirectX Capabilities Viewer utility (find it in your start menu by searching for "DirectX Caps Viewer", the filename is DXCapsViewer.exe). It will list all your devices (emulated, and real hardware ones) under the first node. Expand the hardware entries and then expand again the Direct3D 11 folder. If you see D3D_FEATURE_LEVEL_11_ under that, then your card supports feature level 11 which means it supports DirectCompute and C++ AMP. In the following screenshot of one of my old laptops, the card only goes to feature level 10. Run a utility from the web that just tells you! Of course, writing some C++ AMP code that enumerates accelerators and lists the ones that are capable is trivial. However that requires that you have redistributed the runtime, so a more broadly applicable approach is to use the DX APIs directly to enumerate the DX11 capable cards. That is exactly what the development lead for C++ AMP has done and he describes and shares that utility at this post. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
What DX level does my graphics card support? Does it go to 11?

- by Daniel Moth

Recently I run into a situation that I have run into quite a few times. Someone encounters a machine and the question arises: "Is there a DirectX 11 card in this machine?". Typically the reason you are interested in that is because cards with DirectX 11 drivers fully support DirectCompute (and by extension C++ AMP) for GPGPU programming. The driver specifically is WDDM (1.1 on Windows 7 and Windows 8 introduces WDDM 1.2 with cool new capabilities). There are many ways for figuring out if you have a DirectX11 card, so here are the approaches that you can use, with a bonus right at the end of the post. Run DxDiag WindowsKey + R, type DxDiag and hit Enter. That is the DirectX diagnostic tool, which unfortunately, only tells you on the "System" tab what is the highest version of DirectX installed on your machine. So if it reports DirectX 11, that doesn't mean you have a DX11 driver! The "Display" tab has a promising "DDI version" label, but unfortunately that doesn't seem to be accurate on the machines I've tested it with (or I may be misinterpreting its use). Either way, this tool is not the one you want for this purpose, although it is good for telling you the WDDM version among other things. Use the Microsoft hardware page There is a Microsoft Windows 7 compatibility center, that lists all hardware (tip: use the advanced search) and you could try and locate your device there… good luck. Use Wikipedia or the hardware vendor's website Use the Wikipedia page for the vendor cards, for both nvidia and amd. Often this information will also be in the specifications for the cards on the IHV site, but is is nice that wikipedia has a single page per vendor that you can search etc. There is a column in the tables for API support where you can see the DirectX version. Check if it is one of these recommended DX11 cards You may not have a DirectX 11 card and are interested in purchasing one. While I am in no position to make recommendations, I will list here some cards from two big IHVs that we know are DirectX 11 capable. Some AMD (aka ATI) cards Low end, inexpensive DX11 hardware: Radeon 5450, 5550, 6450, 6570 Mid range (decent perf, single precision): Radeon 5750, 5770, 6770, 6790 High end (capable of double precision): Radeon 5850, 5870, 6950, 6970 Single precision APUs: AMD E-Series APUs AMD A-Series APUs Some NVIDIA cards Low end, inexpensive DX11 hardware: GeForce GT430, GT 440, GT520, GTS 450 Quadro 400, 600 Mid-range (decent perf, single precision): GeForce GTX 460, GTX 550 Ti, GTX 560, GTX 560 Ti Quadro 2000 High end (capable of double precision): GeForce GTX 480, GTX 570, GTX 580, GTX 590, GTX 595 Quadro 4000, 5000, 6000 Tesla C2050, C2070, C2075 Get the DirectX SDK and run DirectX Caps Viewer Download and install the June 2010 DirectX SDK. As part of that you now have the DirectX Capabilities Viewer utility (find it in your start menu by searching for "DirectX Caps Viewer", the filename is DXCapsViewer.exe). It will list all your devices (emulated, and real hardware ones) under the first node. Expand the hardware entries and then expand again the Direct3D 11 folder. If you see D3D_FEATURE_LEVEL_11_ under that, then your card supports feature level 11 which means it supports DirectCompute and C++ AMP. In the following screenshot of one of my old laptops, the card only goes to feature level 10. Run a utility from the web that just tells you! Of course, writing some C++ AMP code that enumerates accelerators and lists the ones that are capable is trivial. However that requires that you have redistributed the runtime, so a more broadly applicable approach is to use the DX APIs directly to enumerate the DX11 capable cards. That is exactly what the development lead for C++ AMP has done and he describes and shares that utility at this post. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Lead, Follow, or Get out of the way

- by Daniel Moth

This is one of the sayings (attributed to Thomas Paine) that totally resonated with me from the first time I heard it, which was only 3 years ago during some training course at work: "Lead, Follow, or Get out of the way" You'll find many books with this title and you'll find it quoted by politicians and other leaders in various countries at various times... the quote is open to interpretation and works on many levels. To set the tone of what this means to me, I'll use a simple micro example: In any given conversation, you are either leading it or following it, at different times/snapshots of the conversation. If you are not willing or able to lead it, and you are not willing or able to follow it, then you should depart. The bad alternative which this guidance encourages you NOT to do is to stick around and obstruct progress by not following, not leading, and simply complaining or trying to derail the discussion in no particular direction. The same pattern applies at your position/role at work. Either follow your management/leadership team, or try to lead them to what you think is a better place, or change jobs. Don't stick around complaining about the direction things are going, while not actively trying to either change things or make peace with it. In the previous paragraph you can replace the word "your management" with "the people reporting to you" and the guidance still holds. Either lead your direct reports to where you think they should go, or follow their lead, or change jobs. Complaining about folks not taking direction while doing nothing is not a maintainable state. To me this quote is not about a permanent state, it is not about some people always leading and some always following: It is about a role/hat that anybody can play/wear at any given moment. One minute I am leading you, the next I am following you, and the next we are both following someone else and so on... When there is disagreement, debate the different directions for as long as it takes for you to be comfortable that you can either follow or lead. If you don't become comfortable with either of those, get out of the way. Something to remember is that it is impossible to learn how to lead well, without learning how to follow well (probably deserves its own blog entry)... Things go wrong when someone thinks that they must always be leading, or when everybody wants to follow and nobody steps up to lead... Things go wrong when more than one person wants to lead and they don't try to reach agreement on a shared direction, stubbornly sticking to their guns pulling the rest of the team in multiple directions... Things go wrong when more than one person wants to lead and after numerous and lengthy discussions, none of them decides to follow or get out of the way... Things go wrong when people don't want to lead, don't want to follow, and insist on sticking around... While there are a few ways things that can go wrong as enumerated in the previous paragraph, the most common one in my experience is the last one I mentioned. You'll recognize these folks as the ones that always complain about everything that is wrong with their company/product but do nothing about it. Every time you hear someone giving feedback on how something is wrong or suboptimal, ask them "So now that you identified the problem, what do you think the solution is and what are you doing to drive us to that solution?" The next time things start going wrong, step up and remind everyone: Lead, Follow, or Get out of the way. For more perspectives, and for input to help you form your own interpretation, search the web for this phrase to see in what contexts it is being used (bing, google). Finally, regardless of your political views, I hope you can appreciate if only as an example this perspective of someone leading by actually getting out of the way. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Give a session on C++ AMP – here is how

- by Daniel Moth

Ever since presenting on C++ AMP at the AMD Fusion conference in June, then the Gamefest conference in August, and the BUILD conference in September, I've had numerous requests about my material from folks that want to re-deliver the same session. The C++ AMP session I put together has evolved over the 3 presentations to its final form that I used at BUILD, so that is the one I recommend you base yours on. Please get the slides and the recording from channel9 (I'll refer to slide numbers below). This is how I've been presenting the C++ AMP session: Context (slide 3, 04:18-08:18) Start with a demo, on my dual-GPU machine. I've been using the N-Body sample (for VS 11 Developer Preview). (slide 4) Use an nvidia slide that has additional examples of performance improvements that customers enjoy with heterogeneous computing. (slide 5) Talk a bit about the differences today between CPU and GPU hardware, leading to the fact that these will continue to co-exist and that GPUs are great for data parallel algorithms, but not much else today. One is a jack of all trades and the other is a number cruncher. (slide 6) Use the APU example from amd, as one indication that the hardware space is still in motion, emphasizing that the C++ AMP solution is a data parallel API, not a GPU API. It has a future proof design for hardware we have yet to see. (slide 7) Provide more meta-data, as blogged about when I first introduced C++ AMP. Code (slide 9-11) Introduce C++ AMP coding with a simplistic array-addition algorithm – the slides speak for themselves. (slide 12-13) index<N>, extent<N>, and grid<N>. (Slide 14-16) array<T,N>, array_view<T,N> and comparison between them. (Slide 17) parallel_for_each. (slide 18, 21) restrict. (slide 19-20) actual restrictions of restrict(direct3d) – the slides speak for themselves. (slide 22) bring it altogether with a matrix multiplication example. (slide 23-24) accelerator, and accelerator_view. (slide 26-29) Introduce tiling incl. tiled matrix multiplication [tiling probably deserves a whole session instead of 6 minutes!]. IDE (slide 34,37) Briefly touch on the concurrency visualizer. It supports GPU profiling, but enhancements specific to C++ AMP we hope will come at the Beta timeframe, which is when I'll be spending more time talking about it. (slide 35-36, 51:54-59:16) Demonstrate the GPU debugging experience in VS 11. Summary (slide 39) Re-iterate some of the points of slide 7, and add the point that the C++ AMP spec will be open for other compiler vendors to implement, even on other platforms (in fact, Microsoft is actively working on that). (slide 40) Links to content – see slide – including where all your questions should go: http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads. "But I don't have time for a full blown session, I only need 2 (or just 1, or 3) C++ AMP slides to use in my session on related topic X" If all you want is a small number of slides, you can take some from the session above and customize them. But because I am so nice, I have created some slides for you, including talking points in the notes section. Download them here. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
concurrency::extent<N> from amp.h

- by Daniel Moth

Overview We saw in a previous post how index<N> represents a point in N-dimensional space and in this post we'll see how to define the N-dimensional space itself. With C++ AMP, an N-dimensional space can be specified with the template class extent<N> where you define the size of each dimension. From a look and feel perspective, you'd expect the programmatic interface of a point type and size type to be similar (even though the concepts are different). Indeed, exactly like index<N>, extent<N> is essentially a coordinate vector of N integers ordered from most- to least- significant, BUT each integer represents the size for that dimension (and hence cannot be negative). So, if you read the description of index, you won't be surprised with the below description of extent<N> There is the rank field returning the value of N you passed as the template parameter. You can construct one extent from another (via the copy constructor or the assignment operator), you can construct it by passing an integer array, or via convenience constructor overloads for 1- 2- and 3- dimension extents. Note that the parameterless constructor creates an extent of the specified rank with all bounds initialized to 0. You can access the components of the extent through the subscript operator (passing it an integer). You can perform some arithmetic operations between extent objects through operator overloading, i.e. ==, !=, +=, -=, +, -. There are operator overloads so that you can perform operations between an extent and an integer: -- (pre- and post- decrement), ++ (pre- and post- increment), %=, *=, /=, +=, –= and, finally, there are additional overloads for plus and minus (+,-) between extent<N> and index<N> objects, returning a new extent object as the result. In addition to the usual suspects, extent offers a contains function that tests if an index is within the bounds of the extent (assuming an origin of zero). It also has a size function that returns the total linear size of this extent<N> in units of elements. Example code extent<2> e(3, 4); _ASSERT(e.rank == 2); _ASSERT(e.size() == 3 * 4); e += 3; e[1] += 6; e = e + index<2>(3,-4); _ASSERT(e == extent<2>(9, 9)); _ASSERT( e.contains(index<2>(8, 8))); _ASSERT(!e.contains(index<2>(8, 9))); grid<N> Our upcoming pre-release bits also have a similar type to extent, grid<N>. The way you create a grid is by passing it an extent, e.g. extent<3> e(4,2,6); grid<3> g(e); I am not going to dive deeper into grid, suffice for now to think of grid<N> simply as an alias for the extent<N> object, that you create when you encounter a function that expects a grid object instead of an extent object. Usage The extent class on its own simply defines the size of the N-dimensional space. We'll see in future posts that when you create containers (arrays) and wrappers (array_views) for your data, it is an extent<N> object that you'll need to use to create those (and use an index<N> object to index into them). We'll also see that it is a grid<N> object that you pass to the new parallel_for_each function that I'll cover in the next post. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
The way I think about Diagnostic tools

- by Daniel Moth

Every software has issues, or as we like to call them "bugs". That is not a discussion point, just a mere fact. It follows that an important skill for developers is to be able to diagnose issues in their code. Of course we need to advance our tools and techniques so we can prevent bugs getting into the code (e.g. unit testing), but beyond designing great software, diagnosing bugs is an equally important skill. To diagnose issues, the most important assets are good techniques, skill, experience, and maybe talent. What also helps is having good diagnostic tools and what helps further is knowing all the features that they offer and how to use them. The following classification is how I like to think of diagnostics. Note that like with any attempt to bucketize anything, you run into overlapping areas and blurry lines. Nevertheless, I will continue sharing my generalizations ;-) It is important to identify at the outset if you are dealing with a performance or a correctness issue. If you have a performance issue, use a profiler. I hear people saying "I am using the debugger to debug a performance issue", and that is fine, but do know that a dedicated profiler is the tool for that job. Just because you don't need them all the time and typically they cost more plus you are not as familiar with them as you are with the debugger, doesn't mean you shouldn't invest in one and instead try to exclusively use the wrong tool for the job. Visual Studio has a profiler and a concurrency visualizer (for profiling multi-threaded apps). If you have a correctness issue, then you have several options - that's next :-) This is how I think of identifying a correctness issue Do you want a tool to find the issue for you at design time? The compiler is such a tool - it gives you an exact list of errors. Compilers now also offer warnings, which is their way of saying "this may be an error, but I am not smart enough to know for sure". There are also static analysis tools, which go a step further than the compiler in identifying issues in your code, sometimes with the aid of code annotations and other times just by pointing them at your raw source. An example is FxCop and much more in Visual Studio 11 Code Analysis. Do you want a tool to find the issue for you with code execution? Just like static tools, there are also dynamic analysis tools that instead of statically analyzing your code, they analyze what your code does dynamically at runtime. Whether you have to setup some unit tests to invoke your code at runtime, or have to manually run your app (and interact with it) under the tool, or have to use a script to execute your binary under the tool… that varies. The result is still a list of issues for you to address after the analysis is complete or a pause of the execution when the first issue is encountered. If a code path was not taken, no analysis for it will exist, obviously. An example is the GPU Race detection tool that I'll be talking about on the C++ AMP team blog. Another example is the MSR concurrency CHESS tool. Do you want you to find the issue at design time using a tool? Perform a code walkthrough on your own or with colleagues. There are code review tools that go beyond just diffing sources, and they help you with that aspect too. For example, there is a new one in Visual Studio 11 and searching with my favorite search engine yielded this article based on the Developer Preview. Do you want you to find the issue with code execution? Use a debugger - let’s break this down further next. This is how I think of debugging: There is post mortem debugging. That means your code has executed and you did something in order to examine what happened during its execution. This can vary from manual printf and other tracing statements to trace events (e.g. ETW) to taking dumps. In all cases, you are left with some artifact that you examine after the fact (after code execution) to discern what took place hoping it will help you find the bug. Learn how to debug dump files in Visual Studio. There is live debugging. I will elaborate on this in a separate post, but this is where you inspect the state of your program during its execution, and try to find what the problem is. More from me in a separate post on live debugging. There is a hybrid of live plus post-mortem debugging. This is for example what tools like IntelliTrace offer. If you are a tools vendor interested in the diagnostics space, it helps to understand where in the above classification your tool excels, where its primary strength is, so you can market it as such. Then it helps to see which of the other areas above your tool touches on, and how you can make it even better there. Finally, see what areas your tool doesn't help at all with, and evaluate whether it should or continue to stay clear. Even though the classification helps us think about this space, the reality is that the best tools are either extremely excellent in only one of this areas, or more often very good across a number of them. Another approach is to offer a toolset covering all areas, with appropriate integration and hand off points from one to the other. Anyway, with that brain dump out of the way, in follow-up posts I will dive into live debugging, and specifically live debugging in Visual Studio - stay tuned if that interests you. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
concurrency::accelerator

- by Daniel Moth

Overview An accelerator represents a "target" on which C++ AMP code can execute and where data can reside. Typically (but not necessarily) an accelerator is a GPU device. Accelerators are represented in C++ AMP as objects of the accelerator class. For many scenarios, you do not need to obtain an accelerator object, since the runtime has a notion of a default accelerator, which is what it thinks is the best one in the system. Examples where you need to deal with accelerator objects are if you need to pick your own accelerator (based on your specific criteria), or if you need to use more than one accelerators from your app. Construction and operator usage You can query and obtain a std::vector of all the accelerators on your system, which the runtime discovers on startup. Beyond enumerating accelerators, you can also create one directly by passing to the constructor a system-wide unique path to a device if you know it (i.e. the “Device Instance Path” property for the device in Device Manager), e.g. accelerator acc(L"PCI\\VEN_1002&DEV_6898&SUBSYS_0B001002etc"); There are some predefined strings (for predefined accelerators) that you can pass to the accelerator constructor (and there are corresponding constants for those on the accelerator class itself, so you don’t have to hardcode them every time). Examples are the following: accelerator::default_accelerator represents the default accelerator that the C++ AMP runtime picks for you if you don’t pick one (the heuristics of how it picks one will be covered in a future post). Example: accelerator acc; accelerator::direct3d_ref represents the reference rasterizer emulator that simulates a direct3d device on the CPU (in a very slow manner). This emulator is available on systems with Visual Studio installed and is useful for debugging. More on debugging in general in future posts. Example: accelerator acc(accelerator::direct3d_ref); accelerator::direct3d_warp represents a target that I will cover in future blog posts. Example: accelerator acc(accelerator::direct3d_warp); accelerator::cpu_accelerator represents the CPU. In this first release the only use of this accelerator is for using the staging arrays technique that I'll cover separately. Example: accelerator acc(accelerator::cpu_accelerator); You can also create an accelerator by shallow copying another accelerator instance (via the corresponding constructor) or simply assigning it to another accelerator instance (via the operator overloading of =). Speaking of operator overloading, you can also compare (for equality and inequality) two accelerator objects between them to determine if they refer to the same underlying device. Querying accelerator characteristics Given an accelerator object, you can access its description, version, device path, size of dedicated memory in KB, whether it is some kind of emulator, whether it has a display attached, whether it supports double precision, and whether it was created with the debugging layer enabled for extensive error reporting. Below is example code that accesses some of the properties; in your real code you'd probably be checking one or more of them in order to pick an accelerator (or check that the default one is good enough for your specific workload): void inspect_accelerator(concurrency::accelerator acc) { std::wcout << "New accelerator: " << acc.description << std::endl; std::wcout << "is_debug = " << acc.is_debug << std::endl; std::wcout << "is_emulated = " << acc.is_emulated << std::endl; std::wcout << "dedicated_memory = " << acc.dedicated_memory << std::endl; std::wcout << "device_path = " << acc.device_path << std::endl; std::wcout << "has_display = " << acc.has_display << std::endl; std::wcout << "version = " << (acc.version >> 16) << '.' << (acc.version & 0xFFFF) << std::endl; } accelerator_view In my next blog post I'll cover a related class: accelerator_view. Suffice to say here that each accelerator may have from 1..n related accelerator_view objects. You can get the accelerator_view from an accelerator via the default_view property, or create new ones by invoking the create_view method that creates an accelerator_view object for you (by also accepting a queuing_mode enum value of deferred or immediate that we'll also explore in the next blog post). Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Sending Outlook Invites

- by Daniel Moth

Sending an Outlook invite for a meeting (also referred to as S+ in Microsoft) is a simple thing to get right if you just run the quick mental check below, which is driven by visual cues in the Outlook UI. I know that some folks don’t do this often or are new to Outlook, so if you know one of those folks share this blog post with them and if they read nothing else ask them to read step 7. Add on the To line the folks that you want to be at the meeting. Indicate optional invitees. Click on the “To” button to bring up the dialog that lets you move folks to be Optional (you can also do this from the Scheduling Assistant). Set the Reminder according to the attendee that has to travel the most. 5 minutes is the minimum. Use the Response Options and uncheck the "Request Response" if your event is going ahead regardless of who can make it or not, i.e. if everyone is optional. Don’t force every recipient to make an extra click, instead make the extra click yourself - you are the organizer. Add a good subject Make the subject such that just by reading it folks know what the meeting is about. Examples, e.g. "Review…", "Finalize…", "XYZ sync up" If this is only between two people and what is commonly referred to as a one to one, the subject would be something like "MyName/YourName 1:1" Write the subject in such a way that when the recipient sees this on their calendar among all the other items, they know what this meeting is about without having to see location, recipients, or any other information about the invite. Add a location, typically a meeting room. If recipients are from different buildings, schedule it where the folks that are doing the other folks a favor live. Otherwise schedule it wherever the least amount of people will have to travel. If you send me an invite to come to your building, and there is more of us than you, you are silently sending me the message that you are doing me a favor so if you don’t want to do that, include a note of why this is in your building, e.g. "Sorry we are slammed with back to back meetings today so hope you can come over to our building". If this is in someone's office, the location would be something like "Moth's office (7/666)" where in parenthesis you see the office location. If some folks are remote in another building/country, or if you know you picked a time which wasn't free for everyone, add an Online option (click the Lync Meeting button). Add a date and time. This MUST be at a time that is showing on the recipients’ calendar as FREE or at worst TENTATIVE. You can check that on the Scheduling Assistant. The reality is that this is not always possible, so in that case you MUST say something about it in the Invite Body, e.g. "Sorry I can see X has a conflict, but I cannot find a better slot", or "With so many of us there are some conflicts and I cannot find a better slot so hope this works", or "Apologies but due to Y we must have this meeting at this time and I know there are some conflicts, hope you can make it anyway". When you do that, I better not be able to find a better slot myself for all of us, and of course when you do that you have implicitly designated the Busy folks as optional. Finally, the body of the invite. This has the agenda of the meeting and if applicable the courtesy apologies due to messing up steps 6 & 7. This should not be the introduction to the meeting, in other words the recipients should not be surprised when they see the invite and go to the body to read it. Notifying them of the meeting takes place via separate email where you explain the purpose and give them a heads up that you'll be sending an invite. That separate email is also your chance to attach documents, don’t do that as part of the invite. TIP: If you have sent mail about the meeting, you can then go to your sent folder to select the message and click the "Meeting" button (Ctrl+Alt+R). This will populate the body with the necessary background, auto select the mandatory and optional attendees as per the TO/CC line, and have a subject that may be good enough already (or you can tweak it). Long to write, but very quick to remember and enforce since most of it is common sense and the checklist is driven of the visual cues in the UI you use to send the invite. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
concurrency::index<N> from amp.h

- by Daniel Moth

Overview C++ AMP introduces a new template class index<N>, where N can be any value greater than zero, that represents a unique point in N-dimensional space, e.g. if N=2 then an index<2> object represents a point in 2-dimensional space. This class is essentially a coordinate vector of N integers representing a position in space relative to the origin of that space. It is ordered from most-significant to least-significant (so, if the 2-dimensional space is rows and columns, the first component represents the rows). The underlying type is a signed 32-bit integer, and component values can be negative. The rank field returns N. Creating an index The default parameterless constructor returns an index with each dimension set to zero, e.g. index<3> idx; //represents point (0,0,0) An index can also be created from another index through the copy constructor or assignment, e.g. index<3> idx2(idx); //or index<3> idx2 = idx; To create an index representing something other than 0, you call its constructor as per the following 4-dimensional example: int temp[4] = {2,4,-2,0}; index<4> idx(temp); Note that there are convenience constructors (that don’t require an array argument) for creating index objects of rank 1, 2, and 3, since those are the most common dimensions used, e.g. index<1> idx(3); index<2> idx(3, 6); index<3> idx(3, 6, 12); Accessing the component values You can access each component using the familiar subscript operator, e.g. One-dimensional example: index<1> idx(4); int i = idx[0]; // i=4 Two-dimensional example: index<2> idx(4,5); int i = idx[0]; // i=4 int j = idx[1]; // j=5 Three-dimensional example: index<3> idx(4,5,6); int i = idx[0]; // i=4 int j = idx[1]; // j=5 int k = idx[2]; // k=6 Basic operations Once you have your multi-dimensional point represented in the index, you can now treat it as a single entity, including performing common operations between it and an integer (through operator overloading): -- (pre- and post- decrement), ++ (pre- and post- increment), %=, *=, /=, +=, -=,%, *, /, +, -. There are also operator overloads for operations between index objects, i.e. ==, !=, +=, -=, +, –. Here is an example (where no assertions are broken): index<2> idx_a; index<2> idx_b(0, 0); index<2> idx_c(6, 9); _ASSERT(idx_a.rank == 2); _ASSERT(idx_a == idx_b); _ASSERT(idx_a != idx_c); idx_a += 5; idx_a[1] += 3; idx_a++; _ASSERT(idx_a != idx_b); _ASSERT(idx_a == idx_c); idx_b = idx_b + 10; idx_b -= index<2>(4, 1); _ASSERT(idx_a == idx_b); Usage You'll most commonly use index<N> objects to index into data types that we'll cover in future posts (namely array and array_view). Also when we look at the new parallel_for_each function we'll see that an index<N> object is the single parameter to the lambda, representing the (multi-dimensional) thread index… In the next post we'll go beyond being able to represent an N-dimensional point in space, and we'll see how to define the N-dimensional space itself through the extent<N> class. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
tile_static, tile_barrier, and tiled matrix multiplication with C++ AMP

- by Daniel Moth

We ended the previous post with a mechanical transformation of the C++ AMP matrix multiplication example to the tiled model and in the process introduced tiled_index and tiled_grid. This is part 2. tile_static memory You all know that in regular CPU code, static variables have the same value regardless of which thread accesses the static variable. This is in contrast with non-static local variables, where each thread has its own copy. Back to C++ AMP, the same rules apply and each thread has its own value for local variables in your lambda, whereas all threads see the same global memory, which is the data they have access to via the array and array_view. In addition, on an accelerator like the GPU, there is a programmable cache, a third kind of memory type if you'd like to think of it that way (some call it shared memory, others call it scratchpad memory). Variables stored in that memory share the same value for every thread in the same tile. So, when you use the tiled model, you can have variables where each thread in the same tile sees the same value for that variable, that threads from other tiles do not. The new storage class for local variables introduced for this purpose is called tile_static. You can only use tile_static in restrict(direct3d) functions, and only when explicitly using the tiled model. What this looks like in code should be no surprise, but here is a snippet to confirm your mental image, using a good old regular C array // each tile of threads has its own copy of locA, // shared among the threads of the tile tile_static float locA[16][16]; Note that tile_static variables are scoped and have the lifetime of the tile, and they cannot have constructors or destructors. tile_barrier In amp.h one of the types introduced is tile_barrier. You cannot construct this object yourself (although if you had one, you could use a copy constructor to create another one). So how do you get one of these? You get it, from a tiled_index object. Beyond the 4 properties returning index objects, tiled_index has another property, barrier, that returns a tile_barrier object. The tile_barrier class exposes a single member, the method wait. 15: // Given a tiled_index object named t_idx 16: t_idx.barrier.wait(); 17: // more code …in the code above, all threads in the tile will reach line 16 before a single one progresses to line 17. Note that all threads must be able to reach the barrier, i.e. if you had branchy code in such a way which meant that there is a chance that not all threads could reach line 16, then the code above would be illegal. Tiled Matrix Multiplication Example – part 2 So now that we added to our understanding the concepts of tile_static and tile_barrier, let me obfuscate rewrite the matrix multiplication code so that it takes advantage of tiling. Before you start reading this, I suggest you get a cup of your favorite non-alcoholic beverage to enjoy while you try to fully understand the code. 01: void MatrixMultiplyTiled(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W) 02: { 03: static const int TS = 16; 04: array_view<const float,2> a(M, W, vA); 05: array_view<const float,2> b(W, N, vB); 06: array_view<writeonly<float>,2> c(M,N,vC); 07: parallel_for_each(c.grid.tile< TS, TS >(), 08: [=] (tiled_index< TS, TS> t_idx) restrict(direct3d) 09: { 10: int row = t_idx.local[0]; int col = t_idx.local[1]; 11: float sum = 0.0f; 12: for (int i = 0; i < W; i += TS) { 13: tile_static float locA[TS][TS], locB[TS][TS]; 14: locA[row][col] = a(t_idx.global[0], col + i); 15: locB[row][col] = b(row + i, t_idx.global[1]); 16: t_idx.barrier.wait(); 17: for (int k = 0; k < TS; k++) 18: sum += locA[row][k] * locB[k][col]; 19: t_idx.barrier.wait(); 20: } 21: c[t_idx.global] = sum; 22: }); 23: } Notice that all the code up to line 9 is the same as per the changes we made in part 1 of tiling introduction. If you squint, the body of the lambda itself preserves the original algorithm on lines 10, 11, and 17, 18, and 21. The difference being that those lines use new indexing and the tile_static arrays; the tile_static arrays are declared and initialized on the brand new lines 13-15. On those lines we copy from the global memory represented by the array_view objects (a and b), to the tile_static vanilla arrays (locA and locB) – we are copying enough to fit a tile. Because in the code that follows on line 18 we expect the data for this tile to be in the tile_static storage, we need to synchronize the threads within each tile with a barrier, which we do on line 16 (to avoid accessing uninitialized memory on line 18). We also need to synchronize the threads within a tile on line 19, again to avoid the race between lines 14, 15 (retrieving the next set of data for each tile and overwriting the previous set) and line 18 (not being done processing the previous set of data). Luckily, as part of the awesome C++ AMP debugger in Visual Studio there is an option that helps you find such races, but that is a story for another blog post another time. May I suggest reading the next section, and then coming back to re-read and walk through this code with pen and paper to really grok what is going on, if you haven't already? Cool. Why would I introduce this tiling complexity into my code? Funny you should ask that, I was just about to tell you. There is only one reason we tiled our extent, had to deal with finding a good tile size, ensure the number of threads we schedule are correctly divisible with the tile size, had to use a tiled_index instead of a normal index, and had to understand tile_barrier and to figure out where we need to use it, and double the size of our lambda in terms of lines of code: the reason is to be able to use tile_static memory. Why do we want to use tile_static memory? Because accessing tile_static memory is around 10 times faster than accessing the global memory on an accelerator like the GPU, e.g. in the code above, if you can get 150GB/second accessing data from the array_view a, you can get 1500GB/second accessing the tile_static array locA. And since by definition you are dealing with really large data sets, the savings really pay off. We have seen tiled implementations being twice as fast as their non-tiled counterparts. Now, some algorithms will not have performance benefits from tiling (and in fact may deteriorate), e.g. algorithms that require you to go only once to global memory will not benefit from tiling, since with tiling you already have to fetch the data once from global memory! Other algorithms may benefit, but you may decide that you are happy with your code being 150 times faster than the serial-version you had, and you do not need to invest to make it 250 times faster. Also algorithms with more than 3 dimensions, which C++ AMP supports in the non-tiled model, cannot be tiled. Also note that in future releases, we may invest in making the non-tiled model, which already uses tiling under the covers, go the extra step and use tile_static memory on your behalf, but it is obviously way to early to commit to anything like that, and we certainly don't do any of that today. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Scheduling thread tiles with C++ AMP

- by Daniel Moth

This post assumes you are totally comfortable with, what some of us call, the simple model of C++ AMP, i.e. you could write your own matrix multiplication. We are now ready to explore the tiled model, which builds on top of the non-tiled one. Tiling the extent We know that when we pass a grid (which is just an extent under the covers) to the parallel_for_each call, it determines the number of threads to schedule and their index values (including dimensionality). For the single-, two-, and three- dimensional cases you can go a step further and subdivide the threads into what we call tiles of threads (others may call them thread groups). So here is a single-dimensional example: extent<1> e(20); // 20 units in a single dimension with indices from 0-19 grid<1> g(e); // same as extent tiled_grid<4> tg = g.tile<4>(); …on the 3rd line we subdivided the single-dimensional space into 5 single-dimensional tiles each having 4 elements, and we captured that result in a concurrency::tiled_grid (a new class in amp.h). Let's move on swiftly to another example, in pictures, this time 2-dimensional: So we start on the left with a grid of a 2-dimensional extent which has 8*6=48 threads. We then have two different examples of tiling. In the first case, in the middle, we subdivide the 48 threads into tiles where each has 4*3=12 threads, hence we have 2*2=4 tiles. In the second example, on the right, we subdivide the original input into tiles where each has 2*2=4 threads, hence we have 4*3=12 tiles. Notice how you can play with the tile size and achieve different number of tiles. The numbers you pick must be such that the original total number of threads (in our example 48), remains the same, and every tile must have the same size. Of course, you still have no clue why you would do that, but stick with me. First, we should see how we can use this tiled_grid, since the parallel_for_each function that we know expects a grid. Tiled parallel_for_each and tiled_index It turns out that we have additional overloads of parallel_for_each that accept a tiled_grid instead of a grid. However, those overloads, also expect that the lambda you pass in accepts a concurrency::tiled_index (new in amp.h), not an index<N>. So how is a tiled_index different to an index? A tiled_index object, can have only 1 or 2 or 3 dimensions (matching exactly the tiled_grid), and consists of 4 index objects that are accessible via properties: global, local, tile_origin, and tile. The global index is the same as the index we know and love: the global thread ID. The local index is the local thread ID within the tile. The tile_origin index returns the global index of the thread that is at position 0,0 of this tile, and the tile index is the position of the tile in relation to the overall grid. Confused? Here is an example accompanied by a picture that hopefully clarifies things: array_view<int, 2> data(8, 6, p_my_data); parallel_for_each(data.grid.tile<2,2>(), [=] (tiled_index<2,2> t_idx) restrict(direct3d) { /* todo */ }); Given the code above and the picture on the right, what are the values of each of the 4 index objects that the t_idx variables exposes, when the lambda is executed by T (highlighted in the picture on the right)? If you can't work it out yourselves, the solution follows: t_idx.global = index<2> (6,3) t_idx.local = index<2> (0,1) t_idx.tile_origin = index<2> (6,2) t_idx.tile = index<2> (3,1) Don't move on until you are comfortable with this… the picture really helps, so use it. Tiled Matrix Multiplication Example – part 1 Let's paste here the C++ AMP matrix multiplication example, bolding the lines we are going to change (can you guess what the changes will be?) 01: void MatrixMultiplyTiled_Part1(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W) 02: { 03: 04: array_view<const float,2> a(M, W, vA); 05: array_view<const float,2> b(W, N, vB); 06: array_view<writeonly<float>,2> c(M, N, vC); 07: parallel_for_each(c.grid, 08: [=](index<2> idx) restrict(direct3d) { 09: 10: int row = idx[0]; int col = idx[1]; 11: float sum = 0.0f; 12: for(int i = 0; i < W; i++) 13: sum += a(row, i) * b(i, col); 14: c[idx] = sum; 15: }); 16: } To turn this into a tiled example, first we need to decide our tile size. Let's say we want each tile to be 16*16 (which assumes that we'll have at least 256 threads to process, and that c.grid.extent.size() is divisible by 256, and moreover that c.grid.extent[0] and c.grid.extent[1] are divisible by 16). So we insert at line 03 the tile size (which must be a compile time constant). 03: static const int TS = 16; ...then we need to tile the grid to have tiles where each one has 16*16 threads, so we change line 07 to be as follows 07: parallel_for_each(c.grid.tile<TS,TS>(), ...that means that our index now has to be a tiled_index with the same characteristics as the tiled_grid, so we change line 08 08: [=](tiled_index<TS, TS> t_idx) restrict(direct3d) { ...which means, without changing our core algorithm, we need to be using the global index that the tiled_index gives us access to, so we insert line 09 as follows 09: index<2> idx = t_idx.global; ...and now this code just works and it is tiled! Closing thoughts on part 1 The process we followed just shows the mechanical transformation that can take place from the simple model to the tiled model (think of this as step 1). In fact, when we wrote the matrix multiplication example originally, the compiler was doing this mechanical transformation under the covers for us (and it has additional smarts to deal with the cases where the total number of threads scheduled cannot be divisible by the tile size). The point is that the thread scheduling is always tiled, even when you use the non-tiled model. But with this mechanical transformation, we haven't gained anything… Hint: our goal with explicitly using the tiled model is to gain even more performance. In the next post, we'll evolve this further (beyond what the compiler can automatically do for us, in this first release), so you can see the full usage of the tiled model and its benefits… Comments about this post by Daniel Moth welcome at the original blog.

Read the article
Effectiveness and Efficiency

- by Daniel Moth

In the professional environment, i.e. at work, I am always seeking personal growth and to be challenged. The result is that my assignments, my work list, my tasks, my goals, my commitments, my [insert whatever word resonates with you] keep growing (in scope and desired impact). Which in turn means I have to keep finding new ways to deliver more value, while not falling into the trap of working more hours. To do that I continuously evaluate both my effectiveness and my efficiency. EFFECTIVENESS The first thing I check is my effectiveness: Am I doing the right things? Am I focusing too much on unimportant things? Am I spending more time doing stuff that is important to my team/org/division/business/company, or am I spending it on stuff that is important to me and that I enjoy doing? Am I valuing activities that maybe I have outgrown and should be delegated to others who are at a stage I have surpassed (in Microsoft speak: is the work I am doing level appropriate or am I still operating at the previous level)? Notice how the answers to those questions change over time and due to certain events, so I have to remind myself to revisit them frequently. Events that force me to re-examine them are: change of role, change of team/org/etc, change of direction of team/org/etc, re-org, new hires on the team that take on some of the work I did, personal promotion, change of manager... and if none of those events has occurred since the last annual review, I ask myself those at each annual review anyway. If you think you are not being effective at work, make a list of the stuff that you do and start tracking where your time goes. In parallel, have a discussion with your manager about where they think your time should go. Ultimately your time is finite and hence it is your most precious investment, don't waste it. If your management doesn't value as highly what you spend your time on, then either convince your management, or stop spending your time on it, or find different management: Lead, Follow, or get out of the way! That's my view on effectiveness. You have to fix that before moving to being efficient, or you may end up being very efficient at stuff that nobody wants you to be doing in the first place. For example, you may be spending your time writing blog posts and becoming better and faster at it all the time. If your manager thinks that is not even part of your job description, you are wasting your time to satisfy your inner desires. Nobody can help you with your effectiveness other than your management chain and your management peers - they are the judges of it. EFFICIENCY The second thing I check is my efficiency: Am I doing things right? For me, doing things right means that I deliver the same quality of work faster [than what I used to, and than my peers, and than expected of me]. The result is that I can achieve more [than what I used to, and than my peers, and than expected of me]. Notice how the efficiency goal is a more portable one. If, by whatever criteria, you think you are the best at [insert your own skill here], this can change at two events: because you have new colleagues (who are potentially better than your older ones), and it can change with a change of manager (who has potentially higher expectations). That's about it. Once you are efficient at something, you carry that with you... All you need to really be doing here is, when taking on new kinds of work that you haven't done before, try a few approaches and devise a system so that you can become efficient at this new activity too... Just keep "collecting" stuff that you are efficient at. If you think you are not being efficient at something, break it down: What are the steps you take to complete that task? How long do you spend on each step? Talk to others about what steps they take, to see if you can optimize some steps away or trade them for better steps, or just learn how to complete a step faster. Have a system for every task you take so that you can have repeatable success. That's my view on efficiency. You have to fix it so that you can free up time to do more. When you plan a route from A to B - all else being equal - you try to get there as fast as possible so why would you not want to do that with your everyday work? For example, imagine you are inefficient at processing email: You spend more time than necessary dealing with email, and you still end up with dropped email threads and with slower response times than others. How can you improve? Talk to someone that you think is good at this, understand their system (e.g. here is my email processing system) and come up with one that works for you. Parting Thoughts Are you considered, by your colleagues and manager, an effective and efficient person at your workplace? If you are, what would you change if you were asked by your management to do the job of two people? Seriously, think about that! Your immediate reaction may be "that is not possible", but it actually is. You just have to re-assess what things that were previously important will now stop being important, by discussing them with your management and reaching agreement on relative priorities. For example, stuff that was previously on your plate may now have to be delegated or dropped. Where you thought you were efficient, maybe now you have to find an even faster path to completion, perhaps keeping in mind that Perfect is the Enemy of “Good Enough”. My personal experience (from both observing others and from my own reflection) is that when folks are struggling to keep up at work it is because of two reasons: They are investing energy in stuff that they enjoy doing which the business regards as having a lower priority than a lot of other things on their plate. They are completing tasks to a level of higher quality than what is required (due to personal pride) missing the big picture which almost always mandates completing three tasks at good enough quality than knocking only one of them out of the park while the other two come in late or not at all. There is a lot of content on the web, so I strongly encourage you to use your favorite search engine to read other views on effectiveness and efficiency (Bing, Google). Comments about this post by Daniel Moth welcome at the original blog.

Read the article
parallel_for_each from amp.h – part 1

- by Daniel Moth

This posts assumes that you've read my other C++ AMP posts on index<N> and extent<N>, as well as about the restrict modifier. It also assumes you are familiar with C++ lambdas (if not, follow my links to C++ documentation). Basic structure and parameters Now we are ready for part 1 of the description of the new overload for the concurrency::parallel_for_each function. The basic new parallel_for_each method signature returns void and accepts two parameters: a grid<N> (think of it as an alias to extent) a restrict(direct3d) lambda, whose signature is such that it returns void and accepts an index of the same rank as the grid So it looks something like this (with generous returns for more palatable formatting) assuming we are dealing with a 2-dimensional space: // some_code_A parallel_for_each( g, // g is of type grid<2> [ ](index<2> idx) restrict(direct3d) { // kernel code } ); // some_code_B The parallel_for_each will execute the body of the lambda (which must have the restrict modifier), on the GPU. We also call the lambda body the "kernel". The kernel will be executed multiple times, once per scheduled GPU thread. The only difference in each execution is the value of the index object (aka as the GPU thread ID in this context) that gets passed to your kernel code. The number of GPU threads (and the values of each index) is determined by the grid object you pass, as described next. You know that grid is simply a wrapper on extent. In this context, one way to think about it is that the extent generates a number of index objects. So for the example above, if your grid was setup by some_code_A as follows: extent<2> e(2,3); grid<2> g(e); ...then given that: e.size()==6, e[0]==2, and e[1]=3 ...the six index<2> objects it generates (and hence the values that your lambda would receive) are: (0,0) (1,0) (0,1) (1,1) (0,2) (1,2) So what the above means is that the lambda body with the algorithm that you wrote will get executed 6 times and the index<2> object you receive each time will have one of the values just listed above (of course, each one will only appear once, the order is indeterminate, and they are likely to call your code at the same exact time). Obviously, in real GPU programming, you'd typically be scheduling thousands if not millions of threads, not just 6. If you've been following along you should be thinking: "that is all fine and makes sense, but what can I do in the kernel since I passed nothing else meaningful to it, and it is not returning any values out to me?" Passing data in and out It is a good question, and in data parallel algorithms indeed you typically want to pass some data in, perform some operation, and then typically return some results out. The way you pass data into the kernel, is by capturing variables in the lambda (again, if you are not familiar with them, follow the links about C++ lambdas), and the way you use data after the kernel is done executing is simply by using those same variables. In the example above, the lambda was written in a fairly useless way with an empty capture list: [ ](index<2> idx) restrict(direct3d), where the empty square brackets means that no variables were captured. If instead I write it like this [&](index<2> idx) restrict(direct3d), then all variables in the some_code_A region are made available to the lambda by reference, but as soon as I try to use any of those variables in the lambda, I will receive a compiler error. This has to do with one of the direct3d restrictions, where only one type can be capture by reference: objects of the new concurrency::array class that I'll introduce in the next post (suffice for now to think of it as a container of data). If I write the lambda line like this [=](index<2> idx) restrict(direct3d), all variables in the some_code_A region are made available to the lambda by value. This works for some types (e.g. an integer), but not for all, as per the restrictions for direct3d. In particular, no useful data classes work except for one new type we introduce with C++ AMP: objects of the new concurrency::array_view class, that I'll introduce in the post after next. Also note that if you capture some variable by value, you could use it as input to your algorithm, but you wouldn’t be able to observe changes to it after the parallel_for_each call (e.g. in some_code_B region since it was passed by value) – the exception to this rule is the array_view since (as we'll see in a future post) it is a wrapper for data, not a container. Finally, for completeness, you can write your lambda, e.g. like this [av, &ar](index<2> idx) restrict(direct3d) where av is a variable of type array_view and ar is a variable of type array - the point being you can be very specific about what variables you capture and how. So it looks like from a large data perspective you can only capture array and array_view objects in the lambda (that is how you pass data to your kernel) and then use the many threads that call your code (each with a unique index) to perform some operation. You can also capture some limited types by value, as input only. When the last thread completes execution of your lambda, the data in the array_view or array are ready to be used in the some_code_B region. We'll talk more about all this in future posts… (a)synchronous Please note that the parallel_for_each executes as if synchronous to the calling code, but in reality, it is asynchronous. I.e. once the parallel_for_each call is made and the kernel has been passed to the runtime, the some_code_B region continues to execute immediately by the CPU thread, while in parallel the kernel is executed by the GPU threads. However, if you try to access the (array or array_view) data that you captured in the lambda in the some_code_B region, your code will block until the results become available. Hence the correct statement: the parallel_for_each is as-if synchronous in terms of visible side-effects, but asynchronous in reality. That's all for now, we'll revisit the parallel_for_each description, once we introduce properly array and array_view – coming next. Comments about this post by Daniel Moth welcome at the original blog.

Read the article

Search Results

Search found 1731 results on 70 pages for 'daniel rose'.

Page 3/70 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

- by Daniel Moth

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >