Search Results

Search found 1774 results on 71 pages for 'parallel for'.

Page 31/71 | < Previous Page | 27 28 29 30 31 32 33 34 35 36 37 38 | Next Page >

ODI 11g – Oracle Multi Table Insert

- by David Allan

With the IKM Oracle Multi Table Insert you can generate Oracle specific DML for inserting into multiple target tables from a single query result – without reprocessing the query or staging its result. When designing this to exploit the IKM you must split the problem into the reusable parts – the select part goes in one interface (I named SELECT_PART), then each target goes in a separate interface (INSERT_SPECIAL and INSERT_REGULAR). So for my statement below… /*INSERT_SPECIAL interface */ insert all when 1=1 And (INCOME_LEVEL > 250000) then into SCOTT.CUSTOMERS_NEW (ID, NAME, GENDER, BIRTH_DATE, MARITAL_STATUS, INCOME_LEVEL, CREDIT_LIMIT, EMAIL, USER_CREATED, DATE_CREATED, USER_MODIFIED, DATE_MODIFIED) values (ID, NAME, GENDER, BIRTH_DATE, MARITAL_STATUS, INCOME_LEVEL, CREDIT_LIMIT, EMAIL, USER_CREATED, DATE_CREATED, USER_MODIFIED, DATE_MODIFIED) /* INSERT_REGULAR interface */ when 1=1 then into SCOTT.CUSTOMERS_SPECIAL (ID, NAME, GENDER, BIRTH_DATE, MARITAL_STATUS, INCOME_LEVEL, CREDIT_LIMIT, EMAIL, USER_CREATED, DATE_CREATED, USER_MODIFIED, DATE_MODIFIED) values (ID, NAME, GENDER, BIRTH_DATE, MARITAL_STATUS, INCOME_LEVEL, CREDIT_LIMIT, EMAIL, USER_CREATED, DATE_CREATED, USER_MODIFIED, DATE_MODIFIED) /*SELECT*PART interface */ select CUSTOMERS.EMAIL EMAIL, CUSTOMERS.CREDIT_LIMIT CREDIT_LIMIT, UPPER(CUSTOMERS.NAME) NAME, CUSTOMERS.USER_MODIFIED USER_MODIFIED, CUSTOMERS.DATE_MODIFIED DATE_MODIFIED, CUSTOMERS.BIRTH_DATE BIRTH_DATE, CUSTOMERS.MARITAL_STATUS MARITAL_STATUS, CUSTOMERS.ID ID, CUSTOMERS.USER_CREATED USER_CREATED, CUSTOMERS.GENDER GENDER, CUSTOMERS.DATE_CREATED DATE_CREATED, CUSTOMERS.INCOME_LEVEL INCOME_LEVEL from SCOTT.CUSTOMERS CUSTOMERS where (1=1) Firstly I create a SELECT_PART temporary interface for the query to be reused and in the IKM assignment I state that it is defining the query, it is not a target and it should not be executed. Then in my INSERT_SPECIAL interface loading a target with a filter, I set define query to false, then set true for the target table and execute to false. This interface uses the SELECT_PART query definition interface as a source. Finally in my final interface loading another target I set define query to false again, set target table to true and execute to true – this is the go run it indicator! To coordinate the statement construction you will need to create a package with the select and insert statements. With 11g you can now execute the package in simulation mode and preview the generated code including the SQL statements. Hopefully this helps shed some light on how you can leverage the Oracle MTI statement. A similar IKM exists for Teradata. The ODI IKM Teradata Multi Statement supports this multi statement request in 11g, here is an extract from the paper at www.teradata.com/white-papers/born-to-be-parallel-eb3053/ Teradata Database offers an SQL extension called a Multi-Statement Request that allows several distinct SQL statements to be bundled together and sent to the optimizer as if they were one. Teradata Database will attempt to execute these SQL statements in parallel. When this feature is used, any sub-expressions that the different SQL statements have in common will be executed once, and the results shared among them. It works in the same way as the ODI MTI IKM, multiple interfaces orchestrated in a package, each interface contributes some SQL, the last interface in the chain executes the multi statement.

Read the article
F# and the rose-tinted reflection

- by CliveT

We're already seeing increasing use of many cores on client desktops. It is a change that has been long predicted. It is not just a change in architecture, but our notions of efficiency in a program. No longer can we focus on the asymptotic complexity of an algorithm by counting the steps that a single core processor would take to execute it. Instead we'll soon be more concerned about the scalability of the algorithm and how well we can increase the performance as we increase the number of cores. This may even lead us to throw away our most efficient algorithms, and switch to less efficient algorithms that scale better. We might even be willing to waste cycles in order to speculatively execute at the algorithm rather than the hardware level. State is the big headache in this parallel world. At the hardware level, main memory doesn't necessarily contain the definitive value corresponding to a particular address. An update to a location might still be held in a CPU's local cache and it might be some time before the value gets propagated. To get the latest value, and the notion of "latest" takes a lot of defining in this world of rapidly mutating state, the CPUs may well need to communicate to decide who has the definitive value of a particular address in order to avoid lost updates. At the user program level, this means programmers will need to lock objects before modifying them, or attempt to avoid the overhead of locking by understanding the memory models at a very deep level. I think it's this need to avoid statefulness that has led to the recent resurgence of interest in functional languages. In the 1980s, functional languages started getting traction when research was carried out into how programs in such languages could be auto-parallelised. Sadly, the impracticality of some of the languages, the overheads of communication during this parallel execution, and rapid improvements in compiler technology on stock hardware meant that the functional languages fell by the wayside. The one thing that these languages were good at was getting rid of implicit state, and this single idea seems like a solution to the problems we are going to face in the coming years. Whether these languages will catch on is hard to predict. The mindset for writing a program in a functional language is really very different from the way that object-oriented problem decomposition happens - one has to focus on the verbs instead of the nouns, which takes some getting used to. There are a number of hybrid functional/object languages that have been becoming more popular in recent times. These half-way houses make it easy to use functional ideas for some parts of the program while still allowing access to the underlying object-focused platform without a great deal of impedance mismatch. One example is F# running on the CLR which, in Visual Studio 2010, has because a first class member of the pack. Inside Visual Studio 2010, the tooling for F# has improved to the point where it is easy to set breakpoints and watch values change while debugging at the source level. In my opinion, it is the tooling support that will enable the widespread adoption of functional languages - without this support, people will put off any transition into the functional world for as long as they possibly can. Without tool support it will make it hard to learn these languages. One tool that doesn't currently support F# is Reflector. The idea of decompiling IL to a functional language is daunting, but F# is potentially so important I couldn't dismiss the idea. As I'm currently developing Reflector 6.5, I thought it wise to take four days just to see how far I could get in doing so, even if it achieved little more than to be clearer on how much was possible, and how long it might take. You can read what happened here, and of the insights it gave us on ways to improve the tool.

Read the article
Give a session on C++ AMP – here is how

- by Daniel Moth

Ever since presenting on C++ AMP at the AMD Fusion conference in June, then the Gamefest conference in August, and the BUILD conference in September, I've had numerous requests about my material from folks that want to re-deliver the same session. The C++ AMP session I put together has evolved over the 3 presentations to its final form that I used at BUILD, so that is the one I recommend you base yours on. Please get the slides and the recording from channel9 (I'll refer to slide numbers below). This is how I've been presenting the C++ AMP session: Context (slide 3, 04:18-08:18) Start with a demo, on my dual-GPU machine. I've been using the N-Body sample (for VS 11 Developer Preview). (slide 4) Use an nvidia slide that has additional examples of performance improvements that customers enjoy with heterogeneous computing. (slide 5) Talk a bit about the differences today between CPU and GPU hardware, leading to the fact that these will continue to co-exist and that GPUs are great for data parallel algorithms, but not much else today. One is a jack of all trades and the other is a number cruncher. (slide 6) Use the APU example from amd, as one indication that the hardware space is still in motion, emphasizing that the C++ AMP solution is a data parallel API, not a GPU API. It has a future proof design for hardware we have yet to see. (slide 7) Provide more meta-data, as blogged about when I first introduced C++ AMP. Code (slide 9-11) Introduce C++ AMP coding with a simplistic array-addition algorithm – the slides speak for themselves. (slide 12-13) index<N>, extent<N>, and grid<N>. (Slide 14-16) array<T,N>, array_view<T,N> and comparison between them. (Slide 17) parallel_for_each. (slide 18, 21) restrict. (slide 19-20) actual restrictions of restrict(direct3d) – the slides speak for themselves. (slide 22) bring it altogether with a matrix multiplication example. (slide 23-24) accelerator, and accelerator_view. (slide 26-29) Introduce tiling incl. tiled matrix multiplication [tiling probably deserves a whole session instead of 6 minutes!]. IDE (slide 34,37) Briefly touch on the concurrency visualizer. It supports GPU profiling, but enhancements specific to C++ AMP we hope will come at the Beta timeframe, which is when I'll be spending more time talking about it. (slide 35-36, 51:54-59:16) Demonstrate the GPU debugging experience in VS 11. Summary (slide 39) Re-iterate some of the points of slide 7, and add the point that the C++ AMP spec will be open for other compiler vendors to implement, even on other platforms (in fact, Microsoft is actively working on that). (slide 40) Links to content – see slide – including where all your questions should go: http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads. "But I don't have time for a full blown session, I only need 2 (or just 1, or 3) C++ AMP slides to use in my session on related topic X" If all you want is a small number of slides, you can take some from the session above and customize them. But because I am so nice, I have created some slides for you, including talking points in the notes section. Download them here. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
SQL Azure: Notes on Building a Shard Technology

- by Herve Roggero

In Chapter 10 of the book on SQL Azure (http://www.apress.com/book/view/9781430229612) I am co-authoring, I am digging deeper in what it takes to write a Shard. It's actually a pretty cool exercise, and I wanted to share some thoughts on how I am designing the technology. A Shard is a technology that spreads the load of database requests over multiple databases, as transparently as possible. The type of shard I am building is called a Vertical Partition Shard (VPS). A VPS is a mechanism by which the data is stored in one or more databases behind the scenes, but your code has no idea at design time which data is in which database. It's like having a mini cloud for records instead of services. Imagine you have three SQL Azure databases that have the same schema (DB1, DB2 and DB3), you would like to issue a SELECT * FROM Users on all three databases, concatenate the results into a single resultset, and order by last name. Imagine you want to ensure your code doesn't need to change if you add a new database to the shard (DB4). Now imagine that you want to make sure all three databases are queried at the same time, in a multi-threaded manner so your code doesn't have to wait for three database calls sequentially. Then, imagine you would like to obtain a breadcrumb (in the form of a new, virtual column) that gives you a hint as to which database a record came from, so that you could update it if needed. Now imagine all that is done through the standard SqlClient library... and you have the Shard I am currently building. Here are some lessons learned and techniques I am using with this shard: Parellel Processing: Querying databases in parallel is not too hard using the Task Parallel Library; all you need is to lock your resources when needed Deleting/Updating Data: That's not too bad either as long as you have a breadcrumb. However it becomes more difficult if you need to update a single record and you don't know in which database it is. Inserting Data: I am using a round-robin approach in which each new insert request is directed to the next database in the shard. Not sure how to deal with Bulk Loads just yet... Shard Databases: I use a static collection of SqlConnection objects which needs to be loaded once; from there on all the Shard commands use this collection Extension Methods: In order to make it look like the Shard commands are part of the SqlClient class I use extension methods. For example I added ExecuteShardQuery and ExecuteShardNonQuery methods to SqlClient. Exceptions: Capturing exceptions in a multi-threaded code is interesting... but I kept it simple for now. I am using the ConcurrentQueue to store my exceptions. Database GUID: Every database in the shard is given a GUID, which is calculated based on the connection string's values. DataTable. The Shard methods return a DataTable object which can be bound to objects. I will be sharing the code soon as an open-source project in CodePlex. Please stay tuned on twitter to know when it will be available (@hroggero). Or check www.bluesyntax.net for updates on the shard. Thanks!

Read the article
Give a session on C++ AMP – here is how

- by Daniel Moth

Ever since presenting on C++ AMP at the AMD Fusion conference in June, then the Gamefest conference in August, and the BUILD conference in September, I've had numerous requests about my material from folks that want to re-deliver the same session. The C++ AMP session I put together has evolved over the 3 presentations to its final form that I used at BUILD, so that is the one I recommend you base yours on. Please get the slides and the recording from channel9 (I'll refer to slide numbers below). This is how I've been presenting the C++ AMP session: Context (slide 3, 04:18-08:18) Start with a demo, on my dual-GPU machine. I've been using the N-Body sample (for VS 11 Developer Preview). (slide 4) Use an nvidia slide that has additional examples of performance improvements that customers enjoy with heterogeneous computing. (slide 5) Talk a bit about the differences today between CPU and GPU hardware, leading to the fact that these will continue to co-exist and that GPUs are great for data parallel algorithms, but not much else today. One is a jack of all trades and the other is a number cruncher. (slide 6) Use the APU example from amd, as one indication that the hardware space is still in motion, emphasizing that the C++ AMP solution is a data parallel API, not a GPU API. It has a future proof design for hardware we have yet to see. (slide 7) Provide more meta-data, as blogged about when I first introduced C++ AMP. Code (slide 9-11) Introduce C++ AMP coding with a simplistic array-addition algorithm – the slides speak for themselves. (slide 12-13) index<N>, extent<N>, and grid<N>. (Slide 14-16) array<T,N>, array_view<T,N> and comparison between them. (Slide 17) parallel_for_each. (slide 18, 21) restrict. (slide 19-20) actual restrictions of restrict(direct3d) – the slides speak for themselves. (slide 22) bring it altogether with a matrix multiplication example. (slide 23-24) accelerator, and accelerator_view. (slide 26-29) Introduce tiling incl. tiled matrix multiplication [tiling probably deserves a whole session instead of 6 minutes!]. IDE (slide 34,37) Briefly touch on the concurrency visualizer. It supports GPU profiling, but enhancements specific to C++ AMP we hope will come at the Beta timeframe, which is when I'll be spending more time talking about it. (slide 35-36, 51:54-59:16) Demonstrate the GPU debugging experience in VS 11. Summary (slide 39) Re-iterate some of the points of slide 7, and add the point that the C++ AMP spec will be open for other compiler vendors to implement, even on other platforms (in fact, Microsoft is actively working on that). (slide 40) Links to content – see slide – including where all your questions should go: http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads. "But I don't have time for a full blown session, I only need 2 (or just 1, or 3) C++ AMP slides to use in my session on related topic X" If all you want is a small number of slides, you can take some from the session above and customize them. But because I am so nice, I have created some slides for you, including talking points in the notes section. Download them here. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
SQL Server Optimizer Malfunction?

- by Tony Davis

There was a sharp intake of breath from the audience when Adam Machanic declared the SQL Server optimizer to be essentially "stuck in 1997". It was during his fascinating "Query Tuning Mastery: Manhandling Parallelism" session at the recent PASS SQL Summit. Paraphrasing somewhat, Adam (blog | @AdamMachanic) offered a convincing argument that the optimizer often delivers flawed plans based on assumptions that are no longer valid with today’s hardware. In 1997, when Microsoft engineers re-designed the database engine for SQL Server 7.0, SQL Server got its initial implementation of a cost-based optimizer. Up to SQL Server 2000, the developer often had to deploy a steady stream of hints in SQL statements to combat the occasionally wilful plan choices made by the optimizer. However, with each successive release, the optimizer has evolved and improved in its decision-making. It is still prone to the occasional stumble when we tackle difficult problems, join large numbers of tables, perform complex aggregations, and so on, but for most of us, most of the time, the optimizer purrs along efficiently in the background. Adam, however, challenged further any assumption that the current optimizer is competent at providing the most efficient plans for our more complex analytical queries, and in particular of offering up correctly parallelized plans. He painted a picture of a present where complex analytical queries have become ever more prevalent; where disk IO is ever faster so that reads from disk come into buffer cache faster than ever; where the improving RAM-to-data ratio means that we have a better chance of finding our data in cache. Most importantly, we have more CPUs at our disposal than ever before. To get these queries to perform, we not only need to have the right indexes, but also to be able to split the data up into subsets and spread its processing evenly across all these available CPUs. Improvements such as support for ColumnStore indexes are taking things in the right direction, but, unfortunately, deficiencies in the current Optimizer mean that SQL Server is yet to be able to exploit properly all those extra CPUs. Adam’s contention was that the current optimizer uses essentially the same costing model for many of its core operations as it did back in the days of SQL Server 7, based on assumptions that are no longer valid. One example he gave was a "slow disk" bias that may have been valid back in 1997 but certainly is not on modern disk systems. Essentially, the optimizer assesses the relative cost of serial versus parallel plans based on the assumption that there is no IO cost benefit from parallelization, only CPU. It assumes that a single request will saturate the IO channel, and so a query would not run any faster if we parallelized IO because the disk system simply wouldn’t be able to handle the extra pressure. As such, the optimizer often decides that a serial plan is lower cost, often in cases where a parallel plan would improve performance dramatically. It was challenging and thought provoking stuff, as were his techniques for driving parallelism through query logic based on subsets of rows that define the "grain" of the query. I highly recommend you catch the session if you missed it. I’m interested to hear though, when and how often people feel the force of the optimizer’s shortcomings. Barring mistakes, such as stale statistics, how often do you feel the Optimizer fails to find the plan you think it should, and what are the most common causes? Is it fighting to induce it toward parallelism? Combating unexpected plans, arising from table partitioning? Something altogether more prosaic? Cheers, Tony.

Read the article
About the K computer

- by nospam(at)example.com (Joerg Moellenkamp)

Okay ? after getting yet another mail because of the new #1 on the Top500 list, I want to add some comments from my side: Yes, the system is using SPARC processor. And that is great news for a SPARC fan like me. It is using the SPARC VIIIfx processor from Fujitsu clocked at 2 GHz. No, it isn't the only one. Most people are saying there are two in the Top500 list using SPARC (#77 JAXA and #1 K) but in fact there are three. The Tianhe-1 (#2 on the Top500 list) super computer contains 2048 Galaxy "FT-1000" 1 GHz 8-core processors. Don't know it? The FeiTeng-1000 ? this proc is a 8 core, 8 threads per core, 1 ghz processor made in China. And it's SPARC based. By the way ? this sounds really familiar to me ? perhaps the people just took the opensourced UltraSPARC-T2 design, because some of the parameters sound just to similar. However it looks like that Tianhe-1 is using the SPARCs as input nodes and not as compute notes. No, I don't see it as the next M-series processor. Simple reason: You can't create SMP systems out of them ? it simply hasn't the functionality to do so. Even when there are multiple CPUs on a single board, they are not connected like an SMP/NUMA machine to a shared memory machine ? they are connected with the cluster interconnect (in this case the Tofu interconnect) and work like a large cluster. Yes, it has a lot of oomph in Linpack ? however I assume a lot came from the extensions to the SPARCv9 standard. No, Linpack has no relevance for any commercial workload ? Linpack is such a special load, that even some HPC people are arguing that it isn't really a good benchmark for HPC. It's embarrassingly parallel, it can work with relatively small interconnects compared to the interconnects in SMP systems (however we get in spheres SMP interconnects where a few years ago). Amdahl isn't hitting that hard when running Linpack. Yes, it's a good move to use SPARC. At some time in the last 10 years, there was an interesting twist in perception: SPARC was considered as proprietary architecture and x86 was the open architecture. However it's vice versa ? try to create a x86 clone and you have a lot of intellectual property problems, create a SPARC clone and you have to spend 100 bucks or so to get the specification from the SPARC Foundation and develop your own SPARC processor. Fujitsu is doing this for a long time now. So they had their own processor, their own know-how. So why was SPARC a good choice? Well ? essentially Fujitsu can do what they want with their core as it is their core, for example adding the extensions to the SPARCv9 chipset ? getting Intel to create extensions to x86 to help you with your product is a little bit harder. So Fujitsu could do they needed to do with their processor in order to create such a supercomputer. No, the K is really using no FPGA or GPU as accelerators. The K is really using the CPU at doing this job. Yes, it has a significantly enhanced FPU capable to execute 8 instructions in parallel. No, it doesn't run Solaris. Yes, it uses Linux. No, it doesn't hurt me ... as my colleague Roland Rambau (he knows a lot about HPC) said once to me ... it doesn't matter which OS is staying out of the way of the workload in HPC.

Read the article
Faster Memory Allocation Using vmtasks

- by Steve Sistare

You may have noticed a new system process called "vmtasks" on Solaris 11 systems: % pgrep vmtasks 8 % prstat -p 8 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 8 root 0K 0K sleep 99 -20 9:10:59 0.0% vmtasks/32 What is vmtasks, and why should you care? In a nutshell, vmtasks accelerates creation, locking, and destruction of pages in shared memory segments. This is particularly helpful for locked memory, as creating a page of physical memory is much more expensive than creating a page of virtual memory. For example, an ISM segment (shmflag & SHM_SHARE_MMU) is locked in memory on the first shmat() call, and a DISM segment (shmflg & SHM_PAGEABLE) is locked using mlock() or memcntl(). Segment operations such as creation and locking are typically single threaded, performed by the thread making the system call. In many applications, the size of a shared memory segment is a large fraction of total physical memory, and the single-threaded initialization is a scalability bottleneck which increases application startup time. To break the bottleneck, we apply parallel processing, harnessing the power of the additional CPUs that are always present on modern platforms. For sufficiently large segments, as many of 16 threads of vmtasks are employed to assist an application thread during creation, locking, and destruction operations. The segment is implicitly divided at page boundaries, and each thread is given a chunk of pages to process. The per-page processing time can vary, so for dynamic load balancing, the number of chunks is greater than the number of threads, and threads grab chunks dynamically as they finish their work. Because the threads modify a single application address space in compressed time interval, contention on locks protecting VM data structures locks was a problem, and we had to re-scale a number of VM locks to get good parallel efficiency. The vmtasks process has 1 thread per CPU and may accelerate multiple segment operations simultaneously, but each operation gets at most 16 helper threads to avoid monopolizing CPU resources. We may reconsider this limit in the future. Acceleration using vmtasks is enabled out of the box, with no tuning required, and works for all Solaris platform architectures (SPARC sun4u, SPARC sun4v, x86). The following tables show the time to create + lock + destroy a large segment, normalized as milliseconds per gigabyte, before and after the introduction of vmtasks: ISM system ncpu before after speedup ------ ---- ------ ----- ------- x4600 32 1386 245 6X X7560 64 1016 153 7X M9000 512 1196 206 6X T5240 128 2506 234 11X T4-2 128 1197 107 11x DISM system ncpu before after speedup ------ ---- ------ ----- ------- x4600 32 1582 265 6X X7560 64 1116 158 7X M9000 512 1165 152 8X T5240 128 2796 198 14X (I am missing the data for T4 DISM, for no good reason; it works fine). The following table separates the creation and destruction times: ISM, T4-2 before after ------ ----- create 702 64 destroy 495 43 To put this in perspective, consider creating a 512 GB ISM segment on T4-2. Creating the segment would take 6 minutes with the old code, and only 33 seconds with the new. If this is your Oracle SGA, you save over 5 minutes when starting the database, and you also save when shutting it down prior to a restart. Those minutes go directly to your bottom line for service availability.

Read the article
Is Rails Metal (& Rack) a good way to implement a high traffic web service api?

- by Greg

I am working on a very typical web application. The main component of the user experience is a widget that a site owner would install on their front page. Every time their front page loads, the widget talks to our server and displays some of the data that returns. So there are two components to this web application: the front end UI that the site owner uses to configure their widget the back end component that responds to the widget's web api call Previously we had all of this running in PHP. Now we are experimenting with Rails, which is fantastic for #1 (the front end UI). The question is how to do #2, the back serving of widget information, efficiently. Obviously this is much higher load than the front end, since it is called every time the front page loads on one of our clients' websites. I can see two obvious approaches: A. Parallel Stack: Set up a parallel stack that uses something other than rails (e.g. our old PHP-based approach) but accesses the same database as the front end B. Rails Metal: Use Rails Metal/Rack to bypass the Rails routing mechanism, but keep the api call responder within the Rails app My main question: Is Rails/Metal a reasonable approach for something like this? But also... Will the overhead of loading the Rails environment still be too heavy? Is there a way to get even closer to the metal with Rails, bypassing most of the environment? Will Rails/Metal performance approach the perf of a similar task on straight PHP (just looking for ballpark here)? And... Is there a 'C' option that would be much better than both A and B? That is, something before going to the lengths of C code compiled to binary and installed as an nginx or apache module? Thanks in advance for any insights.

Read the article
solve a classic map-reduce problem with opencl?

- by liuliu

I am trying to parallel a classic map-reduce problem (which can parallel well with MPI) with OpenCL, namely, the AMD implementation. But the result bothers me. Let me brief about the problem first. There are two type of data that flow into the system: the feature set (30 parameters for each) and the sample set (9000+ dimensions for each). It is a classic map-reduce problem in the sense that I need to calculate the score of every feature on every sample (Map). And then, sum up the overall score for every feature (Reduce). There are around 10k features and 30k samples. I tried different ways to solve the problem. First, I tried to decompose the problem by features. The problem is that the score calculation consists of random memory access (pick some of the 9000+ dimensions and do plus/subtraction calculations). Since I cannot coalesce memory access, it costs. Then, I tried to decompose the problem by samples. The problem is that to sum up overall score, all threads are competing for few score variables. It keeps overwriting the score which turns out to be incorrect. (I cannot carry out individual score first and sum up later because it requires 10k * 30k * 4 bytes). The first method I tried gives me the same performance on i7 860 CPU with 8 threads. However, I don't think the problem is unsolvable: it is remarkably similar to ray tracing problem (for which you carry out calculation that millions of rays against millions of triangles). Any ideas?

Read the article
animation extender in datalist control in asp.net 2008

- by BibiBuBu

Good Day! i have a question that how can i use animation extender in datalist control in asp.net with c#. i want the animation when i click the delete button (delete button will be in repeater). so that when i remove one record then it shows animation to bring the next record. it is in update panel. <cc1:AnimationExtender ID="AnimationExtender1" runat="server" Enabled="True" TargetControlID="btnDeleteId"> <Animations> <OnClick> <Sequence> <EnableAction Enabled="false" /> <Parallel Duration=".2"> <Resize Height="0" Width="0" Unit="px" /> <FadeOut /> </Parallel> <HideAction /> </Sequence> </OnClick> </Animations> </cc1:AnimationExtender> now if i put my button id in the Target control id then it gives error that it should not be in same update panel etc... but over all nothing working for animation. i am binding my datalist in itemDataBound....e.g. ImageButton imgbtn = (ImageButton)e.Item.FindControl("imgBtnPic"); Label lblAvatar = (Label)e.Item.FindControl("lblAvatar"); LinkButton lbName = (LinkButton)e.Item.FindControl("lbtnName"); Can somebody please suggest me something. thanks

Read the article
Spatial Rotation in Gmod Expression2.

- by Fascia

I'm using expression2 to program behavior in Garry's mod (http://wiki.garrysmod.com/?title=Wire_Expression2) Okay so, to set the precedent. In Gmod I have a block and I am at a complete loss of how to get it to rotate around the 3 up, down and right vectors (Which are local. ie; if I pitch it 45 degrees the forward vector is 0.707, 0.707, 0). Essentially, From the 3 vectors I'd like to be able to get local Pitch/Roll/Yaw. By Local Pitch Roll Yaw I mean that they are completely independent of one another allowing true 3d rotation. So for example; if I place my craft so its nose is parallel to the floor the X,Y,Z would be 0,0,0. If I turn it parallel to the floor (World and Local Yaw) 90 degrees it's now 0, 0, 90. If I then pitch it (World Roll, Local Pitch) it 180 degrees it's now 180, 0, 90. I've already explored quaternions however I don't believe I should post my code here as I think I was re-inventing the wheel. I know I didn't explain that well but I believe the problem is pretty generic. Any help anyone could offer is greatly appreciated. Oh, I'd like to avoid gimblelock too. Essentially calculating the rotation around each of the crafts up/forward/right vectors using the up/forward/right vectors. To simply the question a generic implementation rather than one specific to Gmod is absolutely fine.

Read the article
Getting browser to make an AJAX call ASAP, while page is still loading

- by Chris

I'm looking for tips on how to get the browser to kick off an AJAX call as soon as possible into processing a web page, ideally before the page is fully downloaded. Here's my approximate motivation: I have a web page that takes, say, 5 seconds to load. It calls a web service that takes, say, 10 seconds to load. If loading the page and calling the web service happened sequentially, the user would have to wait 15 seconds to see all the information. However, if I can get the web service call started before the 5 second page load is complete, then at least some of the work can happened in parallel. Ideally I'd like as much of the work to happen in parallel as possible. My initial theory was that I should place the AJAX-calling javascript as high up as possible in the web page HTML source (being mindful of putting it after the jquery.js include, because I'm making the call using jquery ajax). I'm also being mindful not to wrap the AJAX call in a jquery ready event handler. (I mention this because ready events are popular in a lot of jquery example code.) However, the AJAX call still doesn't seem to get kicked off as early as I'm hoping (at least as judged by the Google Chrome "Timeline" feature), so I'm wondering what other considerations apply. One thing that might potentially be detrimental is that the AJAX call is back to the same web server that's serving the original web page, so I might be in danger of hitting a browser limit on the # of HTTP connections back to that one machine? (The HTML page loads a number of images, css files, etc..)

Read the article
P6 Architecture - Register renaming aside, does the limited user registers result in more ops spent

- by mrjoltcola

I'm studying JIT design with regard to dynamic languages VM implementation. I haven't done much Assembly since the 8086/8088 days, just a little here or there, so be nice if I'm out of sorts. As I understand it, the x86 (IA-32) architecture still has the same basic limited register set today that it always did, but the internal register count has grown tremendously, but these internal registers are not generally available and are used with register renaming to achieve parallel pipelining of code that otherwise could not be parallelizable. I understand this optimization pretty well, but my feeling is, while these optimizations help in overall throughput and for parallel algorithms, the limited register set we are still stuck with results in more register spilling overhead such that if x86 had double, or quadruple the registers available to us, there may be significantly less push/pop opcodes in a typical instruction stream? Or are there other processor optmizations that also optimize this away that I am unaware of? Basically if I've a unit of code that has 4 registers to work with for integer work, but my unit has a dozen variables, I've got potentially a push/pop for every 2 or so instructions. Any references to studies, or better yet, personal experiences?

Read the article
How can I improve my real-time behavior in multi-threaded app using pthreads and condition variables

- by WilliamKF

I have a multi-threaded application that is using pthreads. I have a mutex() lock and condition variables(). There are two threads, one thread is producing data for the second thread, a worker, which is trying to process the produced data in a real time fashion such that one chuck is processed as close to the elapsing of a fixed time period as possible. This works pretty well, however, occasionally when the producer thread releases the condition upon which the worker is waiting, a delay of up to almost a whole second is seen before the worker thread gets control and executes again. I know this because right before the producer releases the condition upon which the worker is waiting, it does a chuck of processing for the worker if it is time to process another chuck, then immediately upon receiving the condition in the worker thread, it also does a chuck of processing if it is time to process another chuck. In this later case, I am seeing that I am late processing the chuck many times. I'd like to eliminate this lost efficiency and do what I can to keep the chucks ticking away as close to possible to the desired frequency. Is there anything I can do to reduce the delay between the release condition from the producer and the detection that that condition is released such that the worker resumes processing? For example, would it help for the producer to call something to force itself to be context switched out? Bottom line is the worker has to wait each time it asks the producer to create work for itself so that the producer can muck with the worker's data structures before telling the worker it is ready to run in parallel again. This period of exclusive access by the producer is meant to be short, but during this period, I am also checking for real-time work to be done by the producer on behalf of the worker while the producer has exclusive access. Somehow my hand off back to running in parallel again results in significant delay occasionally that I would like to avoid. Please suggest how this might be best accomplished.

Read the article
Where to put external archives to configure running in Eclipse?

- by Buggieboy

As a Java/Eclipse noob coming from a Visual Studio .NET background, I'm trying to understand how to set up my run/debug environment in the Eclipse IDE so I can start stepping through code. I have built my application with separate src and bin hierarchies for each library project, and each project has its own jar. For example, in a Windows environment, I have something like: c:\myapp\myapp_main\src\com\mycorp\myapp\main ...and a parallel "bin" tree like this: c:\myapp\myapp_main\bin\com\mycorp\myapp\main Other supporting projects are, similarly: **c:\myapp\myapp_util\src\com\mycorp\myapp\uti**l (and a parallel "bin" tree.) ... etc. So, I end up with, e.g., myapp_util.jar in the ...\myapp_util\bin... path and then add that as an external archive to my myapp_main project. I also use utilities like gluegen-rt.jar, which I add ad external dependencies to the projects requiring them. I have been able to run outside of the Eclipse environment, by copying all my project jars, gluegen-rt DLL, etc., into a "lib" subfolder of some directory and executing something like: java -Djava.library.path=lib -DfullScreen=false -cp lib/gluegen-rt.jar;lib/myapp_main.jar;lib/myapp_util.jar; com.mycorp.myapp.Main When I first pressed F11 to debug, however, I got a message about something like /com/sun/../glugen... not being found by the class loader. So, to debug, or even just run, in Ecplipse, I tried setting up my VM arguments in the Galileo Debug - (Run/Debug) Configurations to be the command line above, beginning at "-Djava.libary.path...". I've put a lib subdirectory - just like the above with all jars and the native gluegen DLL - in various places, such as beneath the folder that my main jar is built in and as a subfolder of my Ecplipse starting workspace folder, but now Eclipse can't find the main class: java.lang.NoClassDefFoundError: com.mycorp.myapp.Main Although the Classpath says that it is using the "default classpath", whatever that is. Bottom line, how do I assemble the constituent files of a multi-project application so that I can run or debug in Ecplipse?

Read the article
java class creation dynamically and make it accessible across the network different jvms i.e. serial

- by inj.rav

Hi. I have a requirement of creating java classes dynamically and make it accessible different jvms across the network. I tried to use reflection and javassist tool,but nothing worked. Let me explain the scenario we are using Coherence distributed cache. It has a power of doing aggregation/filtering in parallel across the cluster. For example if a class has [dynamic class] has amount variable and getAmount/setAmount methods. Then if we execute COHERENCE queries, it will start process in parallel across the cluster. I tried to create classes at run time by using javassist and reflection. I am able to access it from single JVM, but when I tried to access the same class from other jvm [through coherence cluster]. I am getting exception of class not found [as remote jvm is not having idea of this class].I can over come this by creating same class dynamically on remote jvm also and access the methods. But coherence in built methods/functions are not able to find the class. could some one help me on this matter

Read the article
MySql Query lag time / deadlock?

- by Click Upvote

When there are multiple PHP scripts running in parallel, each making an UPDATE query to the same record in the same table repeatedly, is it possible for there to be a 'lag time' before the table is updated with each query? I have basically 5-6 instances of a PHP script running in parallel, having been launched via cron. Each script gets all the records in the items table, and then loops through them and processes them. However, to avoid processing the same item more than once, I store the id of the last item being processed in a seperate table. So this is how my code works: function getCurrentItem() { $sql = "SELECT currentItemId from settings"; $result = $this->db->query($sql); return $result->get('currentItemId'); } function setCurrentItem($id) { $sql = "UPDATE settings SET currentItemId='$id'"; $this->db->query($sql); } $currentItem = $this->getCurrentItem(); $sql = "SELECT * FROM items WHERE status='pending' AND id > $currentItem'"; $result = $this->db->query($sql); $items = $result->getAll(); foreach ($items as $i) { //Check if $i has been processed by a different instance of the script, and if so, //leave it untouched. if ($this->getCurrentItem() > $i->id) continue; $this->setCurrentItem($i->id); // Process the item here } But despite of all the precautions, most items are being processed more than once. Which makes me think that there is some lag time between the update queries being run by the PHP script, and when the database actually updates the record. Is it true? And if so, what other mechanism should I use to ensure that the PHP scripts always get only the latest currentItemId even when there are multiple scripts running in parrallel? Would using a text file instead of the db help?

Read the article
Quickest algorithm for finding sets with high intersection

- by conradlee

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million groups. To simplify my example and get to the essence of it, let's assume that all groups contain 20 user IDs (i.e., all integer sets have a cardinality of 100). I want to find all pairs of integer sets that have an intersection of 15 or greater. Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.) What is the quickest way to do this? That is, what should my underlying data structure be for representing the integer sets? Sorted sets, unsorted---can hashing somehow help? And what algorithm should I use to compute set intersection)? I prefer answers that relate C/C++ (especially STL), but also any more general, algorithmic insights are welcome. Update Also, note that I will be running this in parallel in a shared memory environment, so ideas that cleanly extend to a parallel solution are preferred.

Read the article
What's the compelling reason to upgrade to Visual Studio 2010 from VS2008?

- by Cheeso

Are there new features in Visual Studio 2010 that are must-haves? If so, which ones? For me, the big draws for VS2008 as compared to VS2005 were LINQ, .NET Framework multitargeting, WCF (REST + Syndication), and general devenv.exe reliability. Granted, some of these features are framework things, and not tool things. For the purposes of this discussion, I'm willing to combine them into one bucket. What is the list of must-have features for VS2010 versus VS2008? Are there any? I am particularly interested in C#. Update: I know how to google, so I can get the official list from Microsoft. I guess what I really wanted was, the assessment from people using it, as to which things are really notable. Microsoft went on for 3 pages about 2008/3.5 features, and many people sort of boiled it down to LINQ, and a few other things. What is that short list for VS2010? Summary so far, what people think is cool or compelling: Visual Studio engine multi-monitor support new extensibility model based on WPF, prettier and more usable new TFS stuff, incl automated test tools parallel debugging .NET Framework parallel extensions for .NET C# 4.0 generic variance optional and named params easier interop with non-managed environments, like COM or Javascript VB 10.0 collection and array literals / initializers automatic properties anonymous methods / statement lambdas I read up on these at Zander's blog. He described these and other features. Nobody on this list said anything about: Visual Studio engine F# support Javascript code-completion JQuery is now included UML better Sharepoint capabilities C++ moves to msbuild project files

Read the article
InvalidOperationException when executing SqlCommand with transaction

- by Serhat Özgel

I have this code, running parallel in two separate threads. It works fine for a few times, but at some random point it throws InvalidOperationException: The transaction is either not associated with the current connection or has been completed. At the point of exception, I am looking inside the transaction with visual studio and verify its connection is set normally. Also command.Transaction._internalTransaction. _transactionState is set to Active and IsZombied property is set to false. This is a test application and I am using Thread.Sleep for creating longer transactions and causing overlaps. Why may the exception being thrown and what can I do about it? IDbCommand command = new SqlCommand("Select * From INFO"); IDbConnection connection = new SqlConnection(connectionString); command.Connection = connection; IDbTransaction transaction = null; try { connection.Open(); transaction = connection.BeginTransaction(); command.Transaction = transaction; command.ExecuteNonQuery(); // Sometimes throws exception Thread.Sleep(forawhile); // For overlapping transactions running in parallel transaction.Commit(); } catch (ApplicationException exception) { if (transaction != null) { transaction.Rollback(); } } finally { connection.Close(); }

Read the article
MySql Query lag time?

- by Click Upvote

When there are multiple PHP scripts running in parallel, each making an UPDATE query to the same record in the same table repeatedly, is it possible for there to be a 'lag time' before the table is updated with each query? I have basically 5-6 instances of a PHP script running in parallel, having been launched via cron. Each script gets all the records in the items table, and then loops through them and processes them. However, to avoid processing the same item more than once, I store the id of the last item being processed in a seperate table. So this is how my code works: function getCurrentItem() { $sql = "SELECT currentItemId from settings"; $result = $this->db->query($sql); return $result->get('currentItemId'); } function setCurrentItem($id) { $sql = "UPDATE settings SET currentItemId='$id'"; $this->db->query($sql); } $currentItem = $this->getCurrentItem(); $sql = "SELECT * FROM items WHERE status='pending' AND id > $currentItem'"; $result = $this->db->query($sql); $items = $result->getAll(); foreach ($items as $i) { //Check if $i has been processed by a different instance of the script, and if so, //leave it untouched. if ($this->getCurrentItem() > $i->id) continue; $this->setCurrentItem($i->id); // Process the item here } But despite of all the precautions, most items are being processed more than once. Which makes me think that there is some lag time between the update queries being run by the PHP script, and when the database actually updates the record. Is it true? And if so, what other mechanism should I use to ensure that the PHP scripts always get only the latest currentItemId even when there are multiple scripts running in parrallel? Would using a text file instead of the db help?

Read the article
How to append to a log file in powershell?

- by Mark Allison

Hi there, I am doing some parallel SQL Server 2005 database restores in powershell. The way I have done it is to use cmd.exe and start so that powershell doesn't wait for it to complete. What I need to do is to pipe the output into a log file with append. If I use Add-Content, then powershell waits, which is not what I want. My code snippet is foreach ($line in $database_list) { <snip> # Create logins sqlcmd.exe -S $instance -E -d master -i $loginsFile -o $logFile # Read commands from a temp file and execute them in parallel with sqlcmd.exe cmd.exe /c start "Restoring $database" /D"$pwd" sqlcmd.exe -S $instance -E -d master -i $tempSQLFile -t 0 -o $logFile [void]$logFiles.Add($logFile) } The problem is that sqlcmd.exe -o overwrites. I've tried doing this to append: cmd.exe /c start "Restoring $database" /D"$pwd" sqlcmd.exe -S $instance -E -d master -i $tempSQLFile -t 0 >> $logFile But it doesn't work because the output stays in the SQLCMD window and doesn't go to the file. Any suggestions? Thanks, Mark.

Read the article
How can I determine PerlLogHandler performance impact?

- by Timmy

I want to create a custom Apache2 log handler, and the template that is found on the apache site is: #file:MyApache2/LogPerUser.pm #--------------------------- package MyApache2::LogPerUser; use strict; use warnings; use Apache2::RequestRec (); use Apache2::Connection (); use Fcntl qw(:flock); use File::Spec::Functions qw(catfile); use Apache2::Const -compile => qw(OK DECLINED); sub handler { my $r = shift; my ($username) = $r->uri =~ m|^/~([^/]+)|; return Apache2::Const::DECLINED unless defined $username; my $entry = sprintf qq(%s [%s] "%s" %d %d\n), $r->connection->remote_ip, scalar(localtime), $r->uri, $r->status, $r->bytes_sent; my $log_path = catfile Apache2::ServerUtil::server_root, "logs", "$username.log"; open my $fh, ">>$log_path" or die "can't open $log_path: $!"; flock $fh, LOCK_EX; print $fh $entry; close $fh; return Apache2::Const::OK; } 1; What is the performance cost of the flocks? Is this logging process done in parallel, or in serial with the HTTP request? In parallel the performance would not matter as much, but I wouldn't want the user to wait another split second to add something like this.

Read the article
Time with and without OpenMP

- by was

I have a question.. I tried to improve a well known program algorithm in C, FOX algorithm for matrix multiplication.. relative link without openMP: (http://web.mst.edu/~ercal/387/MPI/ppmpi_c/chap07/fox.c). The initial program had only MPI and I tried to insert openMP in the matrix multiplication method, in order to improve the time of computation: (This program runs in a cluster and computers have 2 cores, thus I created 2 threads.) The problem is that there is no difference of time, with and without openMP. I observed that using openMP sometimes, time is equivalent or greater than the time without openMP. I tried to multiply two 600x600 matrices. void Local_matrix_multiply( LOCAL_MATRIX_T* local_A /* in */, LOCAL_MATRIX_T* local_B /* in */, LOCAL_MATRIX_T* local_C /* out */) { int i, j, k; chunk = CHUNKSIZE; // 100 #pragma omp parallel shared(local_A, local_B, local_C, chunk, nthreads) private(i,j,k,tid) num_threads(2) { /* tid = omp_get_thread_num(); if(tid == 0){ nthreads = omp_get_num_threads(); printf("O Pollaplasiamos pinakwn ksekina me %d threads\n", nthreads); } printf("Thread %d use the matrix: \n", tid); */ #pragma omp for schedule(static, chunk) for (i = 0; i < Order(local_A); i++) for (j = 0; j < Order(local_A); j++) for (k = 0; k < Order(local_B); k++) Entry(local_C,i,j) = Entry(local_C,i,j) + Entry(local_A,i,k)*Entry(local_B,k,j); } //end pragma omp parallel } /* Local_matrix_multiply */

Read the article

Search Results

Search found 1774 results on 71 pages for 'parallel for'.

Page 31/71 | < Previous Page | 27 28 29 30 31 32 33 34 35 36 37 38 | Next Page >

- by David Allan

- by CliveT

- by Daniel Moth

- by Herve Roggero

- by Daniel Moth

- by Tony Davis

- by nospam(at)example.com (Joerg Moellenkamp)

- by Steve Sistare

- by Greg

- by liuliu

- by BibiBuBu

- by Fascia

- by Chris

- by mrjoltcola

- by WilliamKF

- by Buggieboy

- by inj.rav

- by Click Upvote

- by conradlee

- by Cheeso

- by Serhat Özgel

- by Click Upvote

- by Mark Allison

- by Timmy

- by was

< Previous Page | 27 28 29 30 31 32 33 34 35 36 37 38 | Next Page >