Search Results

Search found 28685 results on 1148 pages for 'query performance'.

Page 15/1148 | < Previous Page | 11 12 13 14 15 16 17 18 19 20 21 22  | Next Page >

  • How does a website latency simulator work

    - by nighthawk457
    Sites like webpagetest allow users to enter a website url and a test location, to run a speed test on the site from multiple locations using real browsers. Can anyone give me a basic idea of how sites like this work? You also have plugin's like Aptimize latency simulator or charles web debugging proxy app, that simulate the delay while accessing a site from different locations. I am assuming since these are plugin's these function in a different way. How do these plugin's work ?

    Read the article

  • T-SQL Tuesday #13: Clarifying Requirements

    - by Alexander Kuznetsov
    When we transform initial ideas into clear requirements for databases, we typically have to make the following choices: Frequent maintenance vs doing it once. As we are clarifying the requirements, we need to determine whether we want to concinue spending considerable time maintaining the system, or if we want to finish it up and move on to other tasks. Race car maintenance vs installing electric wiring is my favorite analogy for this kind of choice. In some cases we need to sqeeze every last bit...(read more)

    Read the article

  • SQLRally Nordic and SQLRally Amsterdam: Wrap Up and Demos

    - by Adam Machanic
    First and foremost : Huge thanks, and huge apologies, to everyone who attended my sessions at these events. I promised to post materials last week, and there is no good excuse for tardiness. My dog did not eat my computer. I don't have a dog. And if I did, she would far prefer a nice rib eye to a hard chunk of plastic. Now, on to the purpose of this post... Last week I was lucky enough to have a first visit to each of two amazing cities, Stockholm and Amsterdam. Both cities, as mentioned previously...(read more)

    Read the article

  • SQLAuthority News – Great Time Spent at Great Indian Developers Summit 2014

    - by Pinal Dave
    The Great Indian Developer Conference (GIDS) is one of the most popular annual event held in Bangalore. This year GIDS is scheduled on April 22, 25. I will be presented total four sessions at this event and each session is very different from each other. Here are the details of four of my sessions, which I presented there. Pluralsight Shades This event was a great event and I had fantastic fun presenting a technology over here. I was indeed very excited that along with me, I had many of my friends presenting at the event as well. I want to thank all of you to attend my session and having standing room every single time. I have already sent resources in my newsletter. You can sign up for the newsletter over here. Indexing is an Art I was amazed with the crowd present in the sessions at GIDS. There was a great interest in the subject of SQL Server and Performance Tuning. Audience at GIDS I believe event like such provides a great platform to meet and share knowledge. Pinal at Pluralsight Booth Here are the abstract of the sessions which I had presented. They were recorded so at some point in time they will be available, but if you want the content of all the courses immediately, I suggest you check out my video courses on the same subject on Pluralsight. Indexes, the Unsung Hero Relevant Pluralsight Course Slow Running Queries are the most common problem that developers face while working with SQL Server. While it is easy to blame SQL Server for unsatisfactory performance, the issue often persists with the way queries have been written, and how Indexes has been set up. The session will focus on the ways of identifying problems that slow down SQL Server, and Indexing tricks to fix them. Developers will walk out with scripts and knowledge that can be applied to their servers, immediately post the session. Indexes are the most crucial objects of the database. They are the first stop for any DBA and Developer when it is about performance tuning. There is a good side as well evil side to indexes. To master the art of performance tuning one has to understand the fundamentals of indexes and the best practices associated with the same. We will cover various aspects of Indexing such as Duplicate Index, Redundant Index, Missing Index as well as best practices around Indexes. SQL Server Performance Troubleshooting: Ancient Problems and Modern Solutions Relevant Pluralsight Course Many believe Performance Tuning and Troubleshooting is an art which has been lost in time. However, truth is that art has evolved with time and there are more tools and techniques to overcome ancient troublesome scenarios. There are three major resources that when bottlenecked creates performance problems: CPU, IO, and Memory. In this session we will focus on High CPU scenarios detection and their resolutions. If time permits we will cover other performance related tips and tricks. At the end of this session, attendees will have a clear idea as well as action items regarding what to do when facing any of the above resource intensive scenarios. Developers will walk out with scripts and knowledge that can be applied to their servers, immediately post the session. To master the art of performance tuning one has to understand the fundamentals of performance, tuning and the best practices associated with the same. We will discuss about performance tuning in this session with the help of Demos. Pinal Dave at GIDS MySQL Performance Tuning – Unexplored Territory Relevant Pluralsight Course Performance is one of the most essential aspects of any application. Everyone wants their server to perform optimally and at the best efficiency. However, not many people talk about MySQL and Performance Tuning as it is an extremely unexplored territory. In this session, we will talk about how we can tune MySQL Performance. We will also try and cover other performance related tips and tricks. At the end of this session, attendees will not only have a clear idea, but also carry home action items regarding what to do when facing any of the above resource intensive scenarios. Developers will walk out with scripts and knowledge that can be applied to their servers, immediately post the session. To master the art of performance tuning one has to understand the fundamentals of performance, tuning and the best practices associated with the same. You will also witness some impressive performance tuning demos in this session. Hidden Secrets and Gems of SQL Server We Bet You Never Knew Relevant Pluralsight Course SQL Trio Session! It really amazes us every time when someone says SQL Server is an easy tool to handle and work with. Microsoft has done an amazing work in making working with complex relational database a breeze for developers and administrators alike. Though it looks like child’s play for some, the realities are far away from this notion. The basics and fundamentals though are simple and uniform across databases, the behavior and understanding the nuts and bolts of SQL Server is something we need to master over a period of time. With a collective experience of more than 30+ years amongst the speakers on databases, we will try to take a unique tour of various aspects of SQL Server and bring to you life lessons learnt from working with SQL Server. We will share some of the trade secrets of performance, configuration, new features, tuning, behaviors, T-SQL practices, common pitfalls, productivity tips on tools and more. This is a highly demo filled session for practical use if you are a SQL Server developer or an Administrator. The speakers will be able to stump you and give you answers on almost everything inside the Relational database called SQL Server. I personally attended the session of Vinod Kumar, Balmukund Lakhani, Abhishek Kumar and my favorite Govind Kanshi. Summary If you have missed this event here are two action items 1) Sign up for Resource Newsletter 2) Watch my video courses on Pluralsight Reference: Pinal Dave (http://blog.sqlauthority.com)Filed under: MySQL, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, SQLAuthority Author Visit, SQLAuthority News, T SQL Tagged: GIDS

    Read the article

  • SQL Server Training in the UK–SSIS, MDX, Admin, MDS, Internals

    - by simonsabin
    If you are looking for SQL Server training they there is no better place to start than a new company Technitrain Its been setup by a fellow MVP and SQLBits Organiser Chris Webb. Why this company rather than any others? Training based on real world experience by the best in the business. The key to Technitrain’s model is not to cram the shelves high with courses and get some average Joe trainers to deliver them. Technitrain bring in world renowned experts in their fields to deliver courses written...(read more)

    Read the article

  • How to test the render speed of my solution in a web browser?

    - by Cuartico
    Ok, I need to test the speed of my solution in a web browser, but I have some problems, there are 2 versions of the web solution, the original one that is on server A and the "fixed" version that is on server B. I have VS2010 Ultimate, so I can make a web and load test on solution B, but I can't load the A solution on my IDE. I was trying to use fiddle2 and jmeter, but they only gave me the times of the request and response of the browsers with the server, I also want the time it takes to the browser to render the whole page. Maybe I'm misusing some of this tools... I don't know if this could be usefull but: Solution A is on VB 6.0 Solution B is on VB.Net Thanks in advance!

    Read the article

  • Improving Partitioned Table Join Performance

    - by Paul White
    The query optimizer does not always choose an optimal strategy when joining partitioned tables. This post looks at an example, showing how a manual rewrite of the query can almost double performance, while reducing the memory grant to almost nothing. Test Data The two tables in this example use a common partitioning partition scheme. The partition function uses 41 equal-size partitions: CREATE PARTITION FUNCTION PFT (integer) AS RANGE RIGHT FOR VALUES ( 125000, 250000, 375000, 500000, 625000, 750000, 875000, 1000000, 1125000, 1250000, 1375000, 1500000, 1625000, 1750000, 1875000, 2000000, 2125000, 2250000, 2375000, 2500000, 2625000, 2750000, 2875000, 3000000, 3125000, 3250000, 3375000, 3500000, 3625000, 3750000, 3875000, 4000000, 4125000, 4250000, 4375000, 4500000, 4625000, 4750000, 4875000, 5000000 ); GO CREATE PARTITION SCHEME PST AS PARTITION PFT ALL TO ([PRIMARY]); There two tables are: CREATE TABLE dbo.T1 ( TID integer NOT NULL IDENTITY(0,1), Column1 integer NOT NULL, Padding binary(100) NOT NULL DEFAULT 0x,   CONSTRAINT PK_T1 PRIMARY KEY CLUSTERED (TID) ON PST (TID) );   CREATE TABLE dbo.T2 ( TID integer NOT NULL, Column1 integer NOT NULL, Padding binary(100) NOT NULL DEFAULT 0x,   CONSTRAINT PK_T2 PRIMARY KEY CLUSTERED (TID, Column1) ON PST (TID) ); The next script loads 5 million rows into T1 with a pseudo-random value between 1 and 5 for Column1. The table is partitioned on the IDENTITY column TID: INSERT dbo.T1 WITH (TABLOCKX) (Column1) SELECT (ABS(CHECKSUM(NEWID())) % 5) + 1 FROM dbo.Numbers AS N WHERE n BETWEEN 1 AND 5000000; In case you don’t already have an auxiliary table of numbers lying around, here’s a script to create one with 10 million rows: CREATE TABLE dbo.Numbers (n bigint PRIMARY KEY);   WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1), L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS(SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM L5) INSERT dbo.Numbers WITH (TABLOCKX) SELECT TOP (10000000) n FROM Nums ORDER BY n OPTION (MAXDOP 1); Table T1 contains data like this: Next we load data into table T2. The relationship between the two tables is that table 2 contains ‘n’ rows for each row in table 1, where ‘n’ is determined by the value in Column1 of table T1. There is nothing particularly special about the data or distribution, by the way. INSERT dbo.T2 WITH (TABLOCKX) (TID, Column1) SELECT T.TID, N.n FROM dbo.T1 AS T JOIN dbo.Numbers AS N ON N.n >= 1 AND N.n <= T.Column1; Table T2 ends up containing about 15 million rows: The primary key for table T2 is a combination of TID and Column1. The data is partitioned according to the value in column TID alone. Partition Distribution The following query shows the number of rows in each partition of table T1: SELECT PartitionID = CA1.P, NumRows = COUNT_BIG(*) FROM dbo.T1 AS T CROSS APPLY (VALUES ($PARTITION.PFT(TID))) AS CA1 (P) GROUP BY CA1.P ORDER BY CA1.P; There are 40 partitions containing 125,000 rows (40 * 125k = 5m rows). The rightmost partition remains empty. The next query shows the distribution for table 2: SELECT PartitionID = CA1.P, NumRows = COUNT_BIG(*) FROM dbo.T2 AS T CROSS APPLY (VALUES ($PARTITION.PFT(TID))) AS CA1 (P) GROUP BY CA1.P ORDER BY CA1.P; There are roughly 375,000 rows in each partition (the rightmost partition is also empty): Ok, that’s the test data done. Test Query and Execution Plan The task is to count the rows resulting from joining tables 1 and 2 on the TID column: SET STATISTICS IO ON; DECLARE @s datetime2 = SYSUTCDATETIME();   SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID;   SELECT DATEDIFF(Millisecond, @s, SYSUTCDATETIME()); SET STATISTICS IO OFF; The optimizer chooses a plan using parallel hash join, and partial aggregation: The Plan Explorer plan tree view shows accurate cardinality estimates and an even distribution of rows across threads (click to enlarge the image): With a warm data cache, the STATISTICS IO output shows that no physical I/O was needed, and all 41 partitions were touched: Running the query without actual execution plan or STATISTICS IO information for maximum performance, the query returns in around 2600ms. Execution Plan Analysis The first step toward improving on the execution plan produced by the query optimizer is to understand how it works, at least in outline. The two parallel Clustered Index Scans use multiple threads to read rows from tables T1 and T2. Parallel scan uses a demand-based scheme where threads are given page(s) to scan from the table as needed. This arrangement has certain important advantages, but does result in an unpredictable distribution of rows amongst threads. The point is that multiple threads cooperate to scan the whole table, but it is impossible to predict which rows end up on which threads. For correct results from the parallel hash join, the execution plan has to ensure that rows from T1 and T2 that might join are processed on the same thread. For example, if a row from T1 with join key value ‘1234’ is placed in thread 5’s hash table, the execution plan must guarantee that any rows from T2 that also have join key value ‘1234’ probe thread 5’s hash table for matches. The way this guarantee is enforced in this parallel hash join plan is by repartitioning rows to threads after each parallel scan. The two repartitioning exchanges route rows to threads using a hash function over the hash join keys. The two repartitioning exchanges use the same hash function so rows from T1 and T2 with the same join key must end up on the same hash join thread. Expensive Exchanges This business of repartitioning rows between threads can be very expensive, especially if a large number of rows is involved. The execution plan selected by the optimizer moves 5 million rows through one repartitioning exchange and around 15 million across the other. As a first step toward removing these exchanges, consider the execution plan selected by the optimizer if we join just one partition from each table, disallowing parallelism: SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = 1 AND $PARTITION.PFT(T2.TID) = 1 OPTION (MAXDOP 1); The optimizer has chosen a (one-to-many) merge join instead of a hash join. The single-partition query completes in around 100ms. If everything scaled linearly, we would expect that extending this strategy to all 40 populated partitions would result in an execution time around 4000ms. Using parallelism could reduce that further, perhaps to be competitive with the parallel hash join chosen by the optimizer. This raises a question. If the most efficient way to join one partition from each of the tables is to use a merge join, why does the optimizer not choose a merge join for the full query? Forcing a Merge Join Let’s force the optimizer to use a merge join on the test query using a hint: SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID OPTION (MERGE JOIN); This is the execution plan selected by the optimizer: This plan results in the same number of logical reads reported previously, but instead of 2600ms the query takes 5000ms. The natural explanation for this drop in performance is that the merge join plan is only using a single thread, whereas the parallel hash join plan could use multiple threads. Parallel Merge Join We can get a parallel merge join plan using the same query hint as before, and adding trace flag 8649: SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID OPTION (MERGE JOIN, QUERYTRACEON 8649); The execution plan is: This looks promising. It uses a similar strategy to distribute work across threads as seen for the parallel hash join. In practice though, performance is disappointing. On a typical run, the parallel merge plan runs for around 8400ms; slower than the single-threaded merge join plan (5000ms) and much worse than the 2600ms for the parallel hash join. We seem to be going backwards! The logical reads for the parallel merge are still exactly the same as before, with no physical IOs. The cardinality estimates and thread distribution are also still very good (click to enlarge): A big clue to the reason for the poor performance is shown in the wait statistics (captured by Plan Explorer Pro): CXPACKET waits require careful interpretation, and are most often benign, but in this case excessive waiting occurs at the repartitioning exchanges. Unlike the parallel hash join, the repartitioning exchanges in this plan are order-preserving ‘merging’ exchanges (because merge join requires ordered inputs): Parallelism works best when threads can just grab any available unit of work and get on with processing it. Preserving order introduces inter-thread dependencies that can easily lead to significant waits occurring. In extreme cases, these dependencies can result in an intra-query deadlock, though the details of that will have to wait for another time to explore in detail. The potential for waits and deadlocks leads the query optimizer to cost parallel merge join relatively highly, especially as the degree of parallelism (DOP) increases. This high costing resulted in the optimizer choosing a serial merge join rather than parallel in this case. The test results certainly confirm its reasoning. Collocated Joins In SQL Server 2008 and later, the optimizer has another available strategy when joining tables that share a common partition scheme. This strategy is a collocated join, also known as as a per-partition join. It can be applied in both serial and parallel execution plans, though it is limited to 2-way joins in the current optimizer. Whether the optimizer chooses a collocated join or not depends on cost estimation. The primary benefits of a collocated join are that it eliminates an exchange and requires less memory, as we will see next. Costing and Plan Selection The query optimizer did consider a collocated join for our original query, but it was rejected on cost grounds. The parallel hash join with repartitioning exchanges appeared to be a cheaper option. There is no query hint to force a collocated join, so we have to mess with the costing framework to produce one for our test query. Pretending that IOs cost 50 times more than usual is enough to convince the optimizer to use collocated join with our test query: -- Pretend IOs are 50x cost temporarily DBCC SETIOWEIGHT(50);   -- Co-located hash join SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID OPTION (RECOMPILE);   -- Reset IO costing DBCC SETIOWEIGHT(1); Collocated Join Plan The estimated execution plan for the collocated join is: The Constant Scan contains one row for each partition of the shared partitioning scheme, from 1 to 41. The hash repartitioning exchanges seen previously are replaced by a single Distribute Streams exchange using Demand partitioning. Demand partitioning means that the next partition id is given to the next parallel thread that asks for one. My test machine has eight logical processors, and all are available for SQL Server to use. As a result, there are eight threads in the single parallel branch in this plan, each processing one partition from each table at a time. Once a thread finishes processing a partition, it grabs a new partition number from the Distribute Streams exchange…and so on until all partitions have been processed. It is important to understand that the parallel scans in this plan are different from the parallel hash join plan. Although the scans have the same parallelism icon, tables T1 and T2 are not being co-operatively scanned by multiple threads in the same way. Each thread reads a single partition of T1 and performs a hash match join with the same partition from table T2. The properties of the two Clustered Index Scans show a Seek Predicate (unusual for a scan!) limiting the rows to a single partition: The crucial point is that the join between T1 and T2 is on TID, and TID is the partitioning column for both tables. A thread that processes partition ‘n’ is guaranteed to see all rows that can possibly join on TID for that partition. In addition, no other thread will see rows from that partition, so this removes the need for repartitioning exchanges. CPU and Memory Efficiency Improvements The collocated join has removed two expensive repartitioning exchanges and added a single exchange processing 41 rows (one for each partition id). Remember, the parallel hash join plan exchanges had to process 5 million and 15 million rows. The amount of processor time spent on exchanges will be much lower in the collocated join plan. In addition, the collocated join plan has a maximum of 8 threads processing single partitions at any one time. The 41 partitions will all be processed eventually, but a new partition is not started until a thread asks for it. Threads can reuse hash table memory for the new partition. The parallel hash join plan also had 8 hash tables, but with all 5,000,000 build rows loaded at the same time. The collocated plan needs memory for only 8 * 125,000 = 1,000,000 rows at any one time. Collocated Hash Join Performance The collated join plan has disappointing performance in this case. The query runs for around 25,300ms despite the same IO statistics as usual. This is much the worst result so far, so what went wrong? It turns out that cardinality estimation for the single partition scans of table T1 is slightly low. The properties of the Clustered Index Scan of T1 (graphic immediately above) show the estimation was for 121,951 rows. This is a small shortfall compared with the 125,000 rows actually encountered, but it was enough to cause the hash join to spill to physical tempdb: A level 1 spill doesn’t sound too bad, until you realize that the spill to tempdb probably occurs for each of the 41 partitions. As a side note, the cardinality estimation error is a little surprising because the system tables accurately show there are 125,000 rows in every partition of T1. Unfortunately, the optimizer uses regular column and index statistics to derive cardinality estimates here rather than system table information (e.g. sys.partitions). Collocated Merge Join We will never know how well the collocated parallel hash join plan might have worked without the cardinality estimation error (and the resulting 41 spills to tempdb) but we do know: Merge join does not require a memory grant; and Merge join was the optimizer’s preferred join option for a single partition join Putting this all together, what we would really like to see is the same collocated join strategy, but using merge join instead of hash join. Unfortunately, the current query optimizer cannot produce a collocated merge join; it only knows how to do collocated hash join. So where does this leave us? CROSS APPLY sys.partitions We can try to write our own collocated join query. We can use sys.partitions to find the partition numbers, and CROSS APPLY to get a count per partition, with a final step to sum the partial counts. The following query implements this idea: SELECT row_count = SUM(Subtotals.cnt) FROM ( -- Partition numbers SELECT p.partition_number FROM sys.partitions AS p WHERE p.[object_id] = OBJECT_ID(N'T1', N'U') AND p.index_id = 1 ) AS P CROSS APPLY ( -- Count per collocated join SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals; The estimated plan is: The cardinality estimates aren’t all that good here, especially the estimate for the scan of the system table underlying the sys.partitions view. Nevertheless, the plan shape is heading toward where we would like to be. Each partition number from the system table results in a per-partition scan of T1 and T2, a one-to-many Merge Join, and a Stream Aggregate to compute the partial counts. The final Stream Aggregate just sums the partial counts. Execution time for this query is around 3,500ms, with the same IO statistics as always. This compares favourably with 5,000ms for the serial plan produced by the optimizer with the OPTION (MERGE JOIN) hint. This is another case of the sum of the parts being less than the whole – summing 41 partial counts from 41 single-partition merge joins is faster than a single merge join and count over all partitions. Even so, this single-threaded collocated merge join is not as quick as the original parallel hash join plan, which executed in 2,600ms. On the positive side, our collocated merge join uses only one logical processor and requires no memory grant. The parallel hash join plan used 16 threads and reserved 569 MB of memory:   Using a Temporary Table Our collocated merge join plan should benefit from parallelism. The reason parallelism is not being used is that the query references a system table. We can work around that by writing the partition numbers to a temporary table (or table variable): SET STATISTICS IO ON; DECLARE @s datetime2 = SYSUTCDATETIME();   CREATE TABLE #P ( partition_number integer PRIMARY KEY);   INSERT #P (partition_number) SELECT p.partition_number FROM sys.partitions AS p WHERE p.[object_id] = OBJECT_ID(N'T1', N'U') AND p.index_id = 1;   SELECT row_count = SUM(Subtotals.cnt) FROM #P AS p CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals;   DROP TABLE #P;   SELECT DATEDIFF(Millisecond, @s, SYSUTCDATETIME()); SET STATISTICS IO OFF; Using the temporary table adds a few logical reads, but the overall execution time is still around 3500ms, indistinguishable from the same query without the temporary table. The problem is that the query optimizer still doesn’t choose a parallel plan for this query, though the removal of the system table reference means that it could if it chose to: In fact the optimizer did enter the parallel plan phase of query optimization (running search 1 for a second time): Unfortunately, the parallel plan found seemed to be more expensive than the serial plan. This is a crazy result, caused by the optimizer’s cost model not reducing operator CPU costs on the inner side of a nested loops join. Don’t get me started on that, we’ll be here all night. In this plan, everything expensive happens on the inner side of a nested loops join. Without a CPU cost reduction to compensate for the added cost of exchange operators, candidate parallel plans always look more expensive to the optimizer than the equivalent serial plan. Parallel Collocated Merge Join We can produce the desired parallel plan using trace flag 8649 again: SELECT row_count = SUM(Subtotals.cnt) FROM #P AS p CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals OPTION (QUERYTRACEON 8649); The actual execution plan is: One difference between this plan and the collocated hash join plan is that a Repartition Streams exchange operator is used instead of Distribute Streams. The effect is similar, though not quite identical. The Repartition uses round-robin partitioning, meaning the next partition id is pushed to the next thread in sequence. The Distribute Streams exchange seen earlier used Demand partitioning, meaning the next partition id is pulled across the exchange by the next thread that is ready for more work. There are subtle performance implications for each partitioning option, but going into that would again take us too far off the main point of this post. Performance The important thing is the performance of this parallel collocated merge join – just 1350ms on a typical run. The list below shows all the alternatives from this post (all timings include creation, population, and deletion of the temporary table where appropriate) from quickest to slowest: Collocated parallel merge join: 1350ms Parallel hash join: 2600ms Collocated serial merge join: 3500ms Serial merge join: 5000ms Parallel merge join: 8400ms Collated parallel hash join: 25,300ms (hash spill per partition) The parallel collocated merge join requires no memory grant (aside from a paltry 1.2MB used for exchange buffers). This plan uses 16 threads at DOP 8; but 8 of those are (rather pointlessly) allocated to the parallel scan of the temporary table. These are minor concerns, but it turns out there is a way to address them if it bothers you. Parallel Collocated Merge Join with Demand Partitioning This final tweak replaces the temporary table with a hard-coded list of partition ids (dynamic SQL could be used to generate this query from sys.partitions): SELECT row_count = SUM(Subtotals.cnt) FROM ( VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10), (11),(12),(13),(14),(15),(16),(17),(18),(19),(20), (21),(22),(23),(24),(25),(26),(27),(28),(29),(30), (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41) ) AS P (partition_number) CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals OPTION (QUERYTRACEON 8649); The actual execution plan is: The parallel collocated hash join plan is reproduced below for comparison: The manual rewrite has another advantage that has not been mentioned so far: the partial counts (per partition) can be computed earlier than the partial counts (per thread) in the optimizer’s collocated join plan. The earlier aggregation is performed by the extra Stream Aggregate under the nested loops join. The performance of the parallel collocated merge join is unchanged at around 1350ms. Final Words It is a shame that the current query optimizer does not consider a collocated merge join (Connect item closed as Won’t Fix). The example used in this post showed an improvement in execution time from 2600ms to 1350ms using a modestly-sized data set and limited parallelism. In addition, the memory requirement for the query was almost completely eliminated  – down from 569MB to 1.2MB. The problem with the parallel hash join selected by the optimizer is that it attempts to process the full data set all at once (albeit using eight threads). It requires a large memory grant to hold all 5 million rows from table T1 across the eight hash tables, and does not take advantage of the divide-and-conquer opportunity offered by the common partitioning. The great thing about the collocated join strategies is that each parallel thread works on a single partition from both tables, reading rows, performing the join, and computing a per-partition subtotal, before moving on to a new partition. From a thread’s point of view… If you have trouble visualizing what is happening from just looking at the parallel collocated merge join execution plan, let’s look at it again, but from the point of view of just one thread operating between the two Parallelism (exchange) operators. Our thread picks up a single partition id from the Distribute Streams exchange, and starts a merge join using ordered rows from partition 1 of table T1 and partition 1 of table T2. By definition, this is all happening on a single thread. As rows join, they are added to a (per-partition) count in the Stream Aggregate immediately above the Merge Join. Eventually, either T1 (partition 1) or T2 (partition 1) runs out of rows and the merge join stops. The per-partition count from the aggregate passes on through the Nested Loops join to another Stream Aggregate, which is maintaining a per-thread subtotal. Our same thread now picks up a new partition id from the exchange (say it gets id 9 this time). The count in the per-partition aggregate is reset to zero, and the processing of partition 9 of both tables proceeds just as it did for partition 1, and on the same thread. Each thread picks up a single partition id and processes all the data for that partition, completely independently from other threads working on other partitions. One thread might eventually process partitions (1, 9, 17, 25, 33, 41) while another is concurrently processing partitions (2, 10, 18, 26, 34) and so on for the other six threads at DOP 8. The point is that all 8 threads can execute independently and concurrently, continuing to process new partitions until the wider job (of which the thread has no knowledge!) is done. This divide-and-conquer technique can be much more efficient than simply splitting the entire workload across eight threads all at once. Related Reading Understanding and Using Parallelism in SQL Server Parallel Execution Plans Suck © 2013 Paul White – All Rights Reserved Twitter: @SQL_Kiwi

    Read the article

  • Oracle TimesTen In-Memory Database Performance on SPARC T4-2

    - by Brian
    The Oracle TimesTen In-Memory Database is optimized to run on Oracle's SPARC T4 processor platforms running Oracle Solaris 11 providing unsurpassed scalability, performance, upgradability, protection of investment and return on investment. The following demonstrate the value of combining Oracle TimesTen In-Memory Database with SPARC T4 servers and Oracle Solaris 11: On a Mobile Call Processing test, the 2-socket SPARC T4-2 server outperforms: Oracle's SPARC Enterprise M4000 server (4 x 2.66 GHz SPARC64 VII+) by 34%. Oracle's SPARC T3-4 (4 x 1.65 GHz SPARC T3) by 2.7x, or 5.4x per processor. Utilizing the TimesTen Performance Throughput Benchmark (TPTBM), the SPARC T4-2 server protects investments with: 2.1x the overall performance of a 4-socket SPARC Enterprise M4000 server in read-only mode and 1.5x the performance in update-only testing. This is 4.2x more performance per processor than the SPARC64 VII+ 2.66 GHz based system. 10x more performance per processor than the SPARC T2+ 1.4 GHz server. 1.6x better performance per processor than the SPARC T3 1.65 GHz based server. In replication testing, the two socket SPARC T4-2 server is over 3x faster than the performance of a four socket SPARC Enterprise T5440 server in both asynchronous replication environment and the highly available 2-Safe replication. This testing emphasizes parallel replication between systems. Performance Landscape Mobile Call Processing Test Performance System Processor Sockets/Cores/Threads Tps SPARC T4-2 SPARC T4, 2.85 GHz 2 16 128 218,400 M4000 SPARC64 VII+, 2.66 GHz 4 16 32 162,900 SPARC T3-4 SPARC T3, 1.65 GHz 4 64 512 80,400 TimesTen Performance Throughput Benchmark (TPTBM) Read-Only System Processor Sockets/Cores/Threads Tps SPARC T3-4 SPARC T3, 1.65 GHz 4 64 512 7.9M SPARC T4-2 SPARC T4, 2.85 GHz 2 16 128 6.5M M4000 SPARC64 VII+, 2.66 GHz 4 16 32 3.1M T5440 SPARC T2+, 1.4 GHz 4 32 256 3.1M TimesTen Performance Throughput Benchmark (TPTBM) Update-Only System Processor Sockets/Cores/Threads Tps SPARC T4-2 SPARC T4, 2.85 GHz 2 16 128 547,800 M4000 SPARC64 VII+, 2.66 GHz 4 16 32 363,800 SPARC T3-4 SPARC T3, 1.65 GHz 4 64 512 240,500 TimesTen Replication Tests System Processor Sockets/Cores/Threads Asynchronous 2-Safe SPARC T4-2 SPARC T4, 2.85 GHz 2 16 128 38,024 13,701 SPARC T5440 SPARC T2+, 1.4 GHz 4 32 256 11,621 4,615 Configuration Summary Hardware Configurations: SPARC T4-2 server 2 x SPARC T4 processors, 2.85 GHz 256 GB memory 1 x 8 Gbs FC Qlogic HBA 1 x 6 Gbs SAS HBA 4 x 300 GB internal disks Sun Storage F5100 Flash Array (40 x 24 GB flash modules) 1 x Sun Fire X4275 server configured as COMSTAR head SPARC T3-4 server 4 x SPARC T3 processors, 1.6 GHz 512 GB memory 1 x 8 Gbs FC Qlogic HBA 8 x 146 GB internal disks 1 x Sun Fire X4275 server configured as COMSTAR head SPARC Enterprise M4000 server 4 x SPARC64 VII+ processors, 2.66 GHz 128 GB memory 1 x 8 Gbs FC Qlogic HBA 1 x 6 Gbs SAS HBA 2 x 146 GB internal disks Sun Storage F5100 Flash Array (40 x 24 GB flash modules) 1 x Sun Fire X4275 server configured as COMSTAR head Software Configuration: Oracle Solaris 11 11/11 Oracle TimesTen 11.2.2.4 Benchmark Descriptions TimesTen Performance Throughput BenchMark (TPTBM) is shipped with TimesTen and measures the total throughput of the system. The workload can test read-only, update-only, delete and insert operations as required. Mobile Call Processing is a customer-based workload for processing calls made by mobile phone subscribers. The workload has a mixture of read-only, update, and insert-only transactions. The peak throughput performance is measured from multiple concurrent processes executing the transactions until a peak performance is reached via saturation of the available resources. Parallel Replication tests using both asynchronous and 2-Safe replication methods. For asynchronous replication, transactions are processed in batches to maximize the throughput capabilities of the replication server and network. In 2-Safe replication, also known as no data-loss or high availability, transactions are replicated between servers immediately emphasizing low latency. For both environments, performance is measured in the number of parallel replication servers and the maximum transactions-per-second for all concurrent processes. See Also SPARC T4-2 Server oracle.com OTN Oracle TimesTen In-Memory Database oracle.com OTN Oracle Solaris oracle.com OTN Oracle Database 11g Release 2 Enterprise Edition oracle.com OTN Disclosure Statement Copyright 2012, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 1 October 2012.

    Read the article

  • Strange performance issue with Dell R7610 and LSI 2208 RAID controller

    - by GregC
    Connecting controller to any of the three PCIe x16 slots yield choppy read performance around 750 MB/sec Lowly PCIe x4 slot yields steady 1.2 GB/sec read Given same files, same Windows Server 2008 R2 OS, same RAID6 24-disk Seagate ES.2 3TB array on LSI 9286-8e, same Dell R7610 Precision Workstation with A03 BIOS, same W5000 graphics card (no other cards), same settings etc. I see super-low CPU utilization in both cases. SiSoft Sandra reports x8 at 5GT/sec in x16 slot, and x4 at 5GT/sec in x4 slot, as expected. I'd like to be able to rely on the sheer speed of x16 slots. What gives? What can I try? Any ideas? Please assist Cross-posted from http://en.community.dell.com/support-forums/desktop/f/3514/t/19526990.aspx Follow-up information We did some more performance testing with reading from 8 SSDs, connected directly (without an expander chip). This means that both SAS cables were utilized. We saw nearly double performance, but it varied from run to run: {2.0, 1.8, 1.6, and 1.4 GB/sec were observed, then performance jumped back up to 2.0}. The SSD RAID0 tests were conducted in a x16 PCIe slot, all other variables kept the same. It seems to me that we were getting double the performance of HDD-based RAID6 array. Just for reference: maximum possible read burst speed over single channel of SAS 6Gb/sec is 570 MB/sec due to 8b/10b encoding and protocol limitations (SAS cable provides four such channels).

    Read the article

  • Looking for application performance tracking software

    - by JavaRocky
    I have multiple java-based applications which produce statistics on how long method calls take. Right now the information is being written into a log file and I analyse performance that way. However with multiple apps and more monitoring requirements this is being becoming a bit overwhelming. I am looking for an application which will collect stats and graph them so I can analyse performance and be aware of performance degradation. I have looked at Solarwinds Application Performance Monitoring, however this polls periodically to gather information. My applications are totally event based and we would like to graph and track this accordingly. I almost started hacking together some scripts to produce Google Charts but surely there are applications which do this already. Suggestions?

    Read the article

  • Hyper-V performance comparisons vs physical client?

    - by rwmnau
    Are there any comparisons between Hyper-V client machines and their physical equivalent? I've looked around and can find 4000 articles about improving Hyper-V performance, but I can't find any that actually do a side-by-side comparison or give benchmarking numbers. Ideally, I'm interested in a comparison of CPU, memory, disk, and graphics performance between something like the following: Some powerful workstation (with plenty of RAM) with Windows 7 installed on it directly Same exact worksation with Hyper-V Server 2008 R2 (the bare Server role) and a full-screen Windows 7 client machine Virtual Server 2005 had performance that didn't compare at all with actual hardware, but with the advances in CPU and hardware-level virtualization, has performance improved significantly? How obvious would it be to a user of the two above scenarios that one of them was virtualized, and does anybody know of actual benchmarking of this type?

    Read the article

  • LTO 2 tape performance in LTO 3 drive

    - by hmallett
    I have a pile of LTO 2 tapes, and both an LTO 2 drive (HP Ultrium 460e), and an autoloader with an LTO 3 drive in (Tandberg T24 autoloader, with a HP drive). Performance of the LTO 2 tapes in the LTO 2 drive is adequate and consistent. HP L&TT tells me that the tapes can be read and written at 64 MB/s, which seems in line with the performance specifications of the drive. When I perform a backup (over the network) using Symantec Backup Exec, I get about 1700 MB/min backup and verify speeds, which is slower, but still adequate. Performance of the LTO 2 tapes in the LTO 3 drive in the autoloader is a different story. HP L&TT tells me that the tapes can be read at 82 MB/s and written at 49 MB/s, which seems unusual at the write speed drop, but not the end of the world. When I perform a backup (over the network) using Symantec Backup Exec though, I get about 331 MB/min backup speed and 205 MB/min verify speeds, which is not only much slower, but also much slower for reads than for writes. Notes: The comparison testing was done on the same server, SCSI card and SCSI cable, with the same backup data set and the same tape each time. The tape and drives are error-free (according to HP L&TT and Backup Exec). The SCSI card is a U160 card, which is not normally recommended for LTO 3, but we're not writing to LTO 3 tapes at LTO 3 speeds, and a U320 SCSI card is not available to me at the moment. As I'm scratching my head to determine the reason for the performance drop, my first question is: While LTO drives can write to the previous generation LTO tapes, does doing so normally incur a performance penalty?

    Read the article

  • Raid-5 Performance per spindle scaling

    - by Bill N.
    So I am stuck in a corner, I have a storage project that is limited to 24 spindles, and requires heavy random Write (the corresponding read side is purely sequential). Needs every bit of space on my Drives, ~13TB total in a n-1 raid-5, and has to go fast, over 2GB/s sort of fast. The obvious answer is to use a Stripe/Concat (Raid-0/1), or better yet a raid-10 in place of the raid-5, but that is disallowed for reasons beyond my control. So I am here asking for help in getting a sub optimal configuration to be as good as it can be. The array built on direct attached SAS-2 10K rpm drives, backed by a ARECA 18xx series controller with 4GB of cache. 64k array stripes and an 4K stripe aligned XFS File system, with 24 Allocation groups (to avoid some of the penalty for being raid 5). The heart of my question is this: In the same setup with 6 spindles/AG's I see a near disk limited performance on the write, ~100MB/s per spindle, at 12 spindles I see that drop to ~80MB/s and at 24 ~60MB/s. I would expect that with a distributed parity and matched AG's, the performance should scale with the # of spindles, or be worse at small spindle counts, but this array is doing the opposite. What am I missing ? Should Raid-5 performance scale with # of spindles ? Many thanks for your answers and any ideas, input, or guidance. --Bill Edit: Improving RAID performance The other relevant thread I was able to find, discusses some of the same issues in the answers, though it still leaves me with out an answer on the performance scaling.

    Read the article

  • How to limit disk performance?

    - by DrakeES
    I am load-testing a web application and studying the impact of some config tweaks (related to disk i/o) on the overall app performance, i.e. the amount of users that can be handled simultaneously. But the problem is that I hit 100% CPU before I can see any effect of the disk-related config settings. I am therefore wondering if there is a way I could deliberately limit the disk performance so that it becomes the bottleneck and the tweaks I am trying to play with actually start impacting performance. Should I just make the hard disk busy with something else? What would serve the best for this purpose? More details (probably irrelevant, but anyway): PHP/Magento/Apache, studying the impact of apc.stat. Setting it to 0 makes APC not checking PHP scripts for modification which should increase performance where disk is the bottleneck. Using JMeter for benchmarking.

    Read the article

  • Good C++ books regarding Performance?

    - by Leon
    Besides the books everyone knows about, like Meyer's 3 Effective C++/STL books, are there any other really good C++ books specifically aimed towards performance code? Maybe this is for gaming, telecommunications, finance/high frequency etc? When I say performance I mean things where a normal C++ book wouldnt bother advising because the gain in performance isn't worthwhile for 95% of C++ developers. Maybe suggestions like avoiding virtual pointers, going into great depth about inlining etc? A book going into great depth on C++ memory allocation or multithreading performance would obviously be very useful.

    Read the article

  • In MySQL, what is the most effective query design for joining large tables with many to many relatio

    - by lighthouse65
    In our application, we collect data on automotive engine performance -- basically source data on engine performance based on the engine type, the vehicle running it and the engine design. Currently, the basis for new row inserts is an engine on-off period; we monitor performance variables based on a change in engine state from active to inactive and vice versa. The related engineState table looks like this: +---------+-----------+---------------+---------------------+---------------------+-----------------+ | vehicle | engine | engine_state | state_start_time | state_end_time | engine_variable | +---------+-----------+---------------+---------------------+---------------------+-----------------+ | 080025 | E01 | active | 2008-01-24 16:19:15 | 2008-01-24 16:24:45 | 720 | | 080028 | E02 | inactive | 2008-01-24 16:19:25 | 2008-01-24 16:22:17 | 304 | +---------+-----------+---------------+---------------------+---------------------+-----------------+ For a specific analysis, we would like to analyze table content based on a row granularity of minutes, rather than the current basis of active / inactive engine state. For this, we are thinking of creating a simple productionMinute table with a row for each minute in the period we are analyzing and joining the productionMinute and engineEvent tables on the date-time columns in each table. So if our period of analysis is from 2009-12-01 to 2010-02-28, we would create a new table with 129,600 rows, one for each minute of each day for that three-month period. The first few rows of the productionMinute table: +---------------------+ | production_minute | +---------------------+ | 2009-12-01 00:00 | | 2009-12-01 00:01 | | 2009-12-01 00:02 | | 2009-12-01 00:03 | +---------------------+ The join between the tables would be engineState AS es LEFT JOIN productionMinute AS pm ON es.state_start_time <= pm.production_minute AND pm.production_minute <= es.event_end_time. This join, however, brings up multiple environmental issues: The engineState table has 5 million rows and the productionMinute table has 130,000 rows When an engineState row spans more than one minute (i.e. the difference between es.state_start_time and es.state_end_time is greater than one minute), as is the case in the example above, there are multiple productionMinute table rows that join to a single engineState table row When there is more than one engine in operation during any given minute, also as per the example above, multiple engineState table rows join to a single productionMinute row In testing our logic and using only a small table extract (one day rather than 3 months, for the productionMinute table) the query takes over an hour to generate. In researching this item in order to improve performance so that it would be feasible to query three months of data, our thoughts were to create a temporary table from the engineEvent one, eliminating any table data that is not critical for the analysis, and joining the temporary table to the productionMinute table. We are also planning on experimenting with different joins -- specifically an inner join -- to see if that would improve performance. What is the best query design for joining tables with the many:many relationship between the join predicates as outlined above? What is the best join type (left / right, inner)?

    Read the article

  • JDBC programms running long time performance issue

    - by phyerbarte
    My program has an issue with Oracle query performance, I believe the SQL have good performance, because it returns quickly in SQLPlus. But when my program has been running for a long time, like 1 week, the SQL query (using JDBC) becomes slower (In my logs, the query time is much longer than when I originally started the program). When I restart my program, the query performance comes back to normal. I think it is could be something wrong with the way I use the preparedStatement, because the SQL I'm using does not use placeholders "?" at all. Just a complex select query. The query process is done by a util class. Here is the pertinent code building the query: public List<String[]> query(String sql, String[] args) { Connection conn = null; conn = openConnection(); conn.setAutocommit(true); .... PreparedStatement preStatm = null; ResultSet rs = null; ....//set preparedstatment arg code rs = preStatm.executeQuery(); .... finally{ //close rs //close prestatm //close connection } } In my case, the args is always null, so it just passes a query sql to this query method. Is that possible this way could slow down the DB query after program long time running? Or I should use statement instead, or just pass args with "?" in the SQL? How can I find out the root cause for my issue? Thanks.

    Read the article

  • How to cache queries in EJB and return result efficient (performance POV)

    - by Maxym
    I use JBoss EJB 3.0 implementation (JBoss 4.2.3 server) At the beginning I created native query all the time using construction like Query query = entityManager.createNativeQuery("select * from _table_"); Of couse it is not that efficient, I performed some tests and found out that it really takes a lot of time... Then I found a better way to deal with it, to use annotation to define native queries: @NamedNativeQuery( name = "fetchData", value = "select * from _table_", resultClass=Entity.class ) and then just use it Query query = entityManager.createNamedQuery("fetchData"); the performance of code line above is two times better than where I started from, but still not that good as I expected... then I found that I can switch to Hibernate annotation for NamedNativeQuery (anyway, JBoss's implementation of EJB is based on Hibernate), and add one more thing: @NamedNativeQuery( name = "fetchData2", value = "select * from _table_", resultClass=Entity.class, readOnly=true) readOnly - marks whether the results are fetched in read-only mode or not. It sounds good, because at least in this case of mine I don't need to update data, I wanna just fetch it for report. When I started server to measure performance I noticed that query without readOnly=true (by default it is false) returns result with each iteration better and better, and at the same time another one (fetchData2) works like "stable" and with time difference between them is shorter and shorter, and after 5 iterations speed of both was almost the same... The questions are: 1) is there any other way to speed query using up? Seems that named queries should be prepared once, but I can't say it... In fact if to create query once and then just use it it would be better from performance point of view, but it is problematic to cache this object, because after creating query I can set parameters (when I use ":variable" in query), and it changes query object (isn't it?). well, is here any way to cache them? Or named query is the best option I can use? 2) any other approaches how to make results retrieveng faster. I mean, for instance I don't need those Entities to be attached, I won't update them, all I need is just fetch collection of data. Maybe readOnly is the only available way, so I can't speed it up, but who knows :) P.S. I don't ask about DB performance, all I need now is how not to create query all the time, so use it efficient, and to "allow" EJB to do less job with the same result concerning data returning.

    Read the article

  • puzzled with java if else performance

    - by user1906966
    I am doing an investigation on a method's performance and finally identified the overhead was caused by the "else" portion of the if else statement. I have written a small program to illustrate the performance difference even when the else portion of the code never gets executed: public class TestIfPerf { public static void main( String[] args ) { boolean condition = true; long time = 0L; int value = 0; // warm up test for( int count=0; count<10000000; count++ ) { if ( condition ) { value = 1 + 2; } else { value = 1 + 3; } } // benchmark if condition only time = System.nanoTime(); for( int count=0; count<10000000; count++ ) { if ( condition ) { value = 1 + 2; } } time = System.nanoTime() - time; System.out.println( "1) performance " + time ); time = System.nanoTime(); // benchmark if else condition for( int count=0; count<10000000; count++ ) { if ( condition ) { value = 1 + 2; } else { value = 1 + 3; } } time = System.nanoTime() - time; System.out.println( "2) performance " + time ); } } and run the test program with java -classpath . -Dmx=800m -Dms=800m TestIfPerf. I performed this on both Mac and Linux Java with 1.6 latest build. Consistently the first benchmark, without the else is much faster than the second benchmark with the else section even though the code is structured such that the else portion is never executed because of the condition. I understand that to some, the difference might not be significant but the relative performance difference is large. I wonder if anyone has any insight to this (or maybe there is something I did incorrectly). Linux benchmark (in nano) performance 1215488 performance 2629531 Mac benchmark (in nano) performance 1667000 performance 4208000

    Read the article

  • Function Folding in #PowerQuery

    - by Darren Gosbell
    Originally posted on: http://geekswithblogs.net/darrengosbell/archive/2014/05/16/function-folding-in-powerquery.aspxLooking at a typical Power Query query you will noticed that it's made up of a number of small steps. As an example take a look at the query I did in my previous post about joining a fact table to a slowly changing dimension. It was roughly built up of the following steps: Get all records from the fact table Get all records from the dimension table do an outer join between these two tables on the business key (resulting in an increase in the row count as there are multiple records in the dimension table for each business key) Filter out the excess rows introduced in step 3 remove extra columns that are not required in the final result set. If Power Query was to execute a query like this literally, following the same steps in the same order it would not be overly efficient. Particularly if your two source tables were quite large. However Power Query has a feature called function folding where it can take a number of these small steps and push them down to the data source. The degree of function folding that can be performed depends on the data source, As you might expect, relational data sources like SQL Server, Oracle and Teradata support folding, but so do some of the other sources like OData, Exchange and Active Directory. To explore how this works I took the data from my previous post and loaded it into a SQL database. Then I converted my Power Query expression to source it's data from that database. Below is the resulting Power Query which I edited by hand so that the whole thing can be shown in a single expression: let     SqlSource = Sql.Database("localhost", "PowerQueryTest"),     BU = SqlSource{[Schema="dbo",Item="BU"]}[Data],     Fact = SqlSource{[Schema="dbo",Item="fact"]}[Data],     Source = Table.NestedJoin(Fact,{"BU_Code"},BU,{"BU_Code"},"NewColumn"),     LeftJoin = Table.ExpandTableColumn(Source, "NewColumn"                                   , {"BU_Key", "StartDate", "EndDate"}                                   , {"BU_Key", "StartDate", "EndDate"}),     BetweenFilter = Table.SelectRows(LeftJoin, each (([Date] >= [StartDate]) and ([Date] <= [EndDate])) ),     RemovedColumns = Table.RemoveColumns(BetweenFilter,{"StartDate", "EndDate"}) in     RemovedColumns If the above query was run step by step in a literal fashion you would expect it to run two queries against the SQL database doing "SELECT * …" from both tables. However a profiler trace shows just the following single SQL query: select [_].[BU_Code],     [_].[Date],     [_].[Amount],     [_].[BU_Key] from (     select [$Outer].[BU_Code],         [$Outer].[Date],         [$Outer].[Amount],         [$Inner].[BU_Key],         [$Inner].[StartDate],         [$Inner].[EndDate]     from [dbo].[fact] as [$Outer]     left outer join     (         select [_].[BU_Key] as [BU_Key],             [_].[BU_Code] as [BU_Code2],             [_].[BU_Name] as [BU_Name],             [_].[StartDate] as [StartDate],             [_].[EndDate] as [EndDate]         from [dbo].[BU] as [_]     ) as [$Inner] on ([$Outer].[BU_Code] = [$Inner].[BU_Code2] or [$Outer].[BU_Code] is null and [$Inner].[BU_Code2] is null) ) as [_] where [_].[Date] >= [_].[StartDate] and [_].[Date] <= [_].[EndDate] The resulting query is a little strange, you can probably tell that it was generated programmatically. But if you look closely you'll notice that every single part of the Power Query formula has been pushed down to SQL Server. Power Query itself ends up just constructing the query and passing the results back to Excel, it does not do any of the data transformation steps itself. So now you can feel a bit more comfortable showing Power Query to your less technical Colleagues knowing that the tool will do it's best fold all the  small steps in Power Query down the most efficient query that it can against the source systems.

    Read the article

  • Performance Testing Versus Unit Testing

    - by Mystagogue
    I'm reading Osherove's "The Art of Unit Testing," and though I've not yet seen him say anything about performance testing, two thoughts still cross my mind: Performance tests generally can't be unit tests, because performance tests generally need to run for long periods of time. Performance tests generally can't be unit tests, because performance issues too often manifest at an integration or system level (or at least the logic of a single unit test needed to re-create the performance of the integration environment would be too involved to be a unit test). Particularly for the first reason stated above, I doubt it makes sense for performance tests to be handled by a unit testing framework (such as NUnit). My question is: do my findings / leanings correspond with the thoughts of the community?

    Read the article

< Previous Page | 11 12 13 14 15 16 17 18 19 20 21 22  | Next Page >