Search Results

Search found 8547 results on 342 pages for 'hash join'.

Page 1/342 | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Improving Partitioned Table Join Performance

    - by Paul White
    The query optimizer does not always choose an optimal strategy when joining partitioned tables. This post looks at an example, showing how a manual rewrite of the query can almost double performance, while reducing the memory grant to almost nothing. Test Data The two tables in this example use a common partitioning partition scheme. The partition function uses 41 equal-size partitions: CREATE PARTITION FUNCTION PFT (integer) AS RANGE RIGHT FOR VALUES ( 125000, 250000, 375000, 500000, 625000, 750000, 875000, 1000000, 1125000, 1250000, 1375000, 1500000, 1625000, 1750000, 1875000, 2000000, 2125000, 2250000, 2375000, 2500000, 2625000, 2750000, 2875000, 3000000, 3125000, 3250000, 3375000, 3500000, 3625000, 3750000, 3875000, 4000000, 4125000, 4250000, 4375000, 4500000, 4625000, 4750000, 4875000, 5000000 ); GO CREATE PARTITION SCHEME PST AS PARTITION PFT ALL TO ([PRIMARY]); There two tables are: CREATE TABLE dbo.T1 ( TID integer NOT NULL IDENTITY(0,1), Column1 integer NOT NULL, Padding binary(100) NOT NULL DEFAULT 0x,   CONSTRAINT PK_T1 PRIMARY KEY CLUSTERED (TID) ON PST (TID) );   CREATE TABLE dbo.T2 ( TID integer NOT NULL, Column1 integer NOT NULL, Padding binary(100) NOT NULL DEFAULT 0x,   CONSTRAINT PK_T2 PRIMARY KEY CLUSTERED (TID, Column1) ON PST (TID) ); The next script loads 5 million rows into T1 with a pseudo-random value between 1 and 5 for Column1. The table is partitioned on the IDENTITY column TID: INSERT dbo.T1 WITH (TABLOCKX) (Column1) SELECT (ABS(CHECKSUM(NEWID())) % 5) + 1 FROM dbo.Numbers AS N WHERE n BETWEEN 1 AND 5000000; In case you don’t already have an auxiliary table of numbers lying around, here’s a script to create one with 10 million rows: CREATE TABLE dbo.Numbers (n bigint PRIMARY KEY);   WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1), L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS(SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM L5) INSERT dbo.Numbers WITH (TABLOCKX) SELECT TOP (10000000) n FROM Nums ORDER BY n OPTION (MAXDOP 1); Table T1 contains data like this: Next we load data into table T2. The relationship between the two tables is that table 2 contains ‘n’ rows for each row in table 1, where ‘n’ is determined by the value in Column1 of table T1. There is nothing particularly special about the data or distribution, by the way. INSERT dbo.T2 WITH (TABLOCKX) (TID, Column1) SELECT T.TID, N.n FROM dbo.T1 AS T JOIN dbo.Numbers AS N ON N.n >= 1 AND N.n <= T.Column1; Table T2 ends up containing about 15 million rows: The primary key for table T2 is a combination of TID and Column1. The data is partitioned according to the value in column TID alone. Partition Distribution The following query shows the number of rows in each partition of table T1: SELECT PartitionID = CA1.P, NumRows = COUNT_BIG(*) FROM dbo.T1 AS T CROSS APPLY (VALUES ($PARTITION.PFT(TID))) AS CA1 (P) GROUP BY CA1.P ORDER BY CA1.P; There are 40 partitions containing 125,000 rows (40 * 125k = 5m rows). The rightmost partition remains empty. The next query shows the distribution for table 2: SELECT PartitionID = CA1.P, NumRows = COUNT_BIG(*) FROM dbo.T2 AS T CROSS APPLY (VALUES ($PARTITION.PFT(TID))) AS CA1 (P) GROUP BY CA1.P ORDER BY CA1.P; There are roughly 375,000 rows in each partition (the rightmost partition is also empty): Ok, that’s the test data done. Test Query and Execution Plan The task is to count the rows resulting from joining tables 1 and 2 on the TID column: SET STATISTICS IO ON; DECLARE @s datetime2 = SYSUTCDATETIME();   SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID;   SELECT DATEDIFF(Millisecond, @s, SYSUTCDATETIME()); SET STATISTICS IO OFF; The optimizer chooses a plan using parallel hash join, and partial aggregation: The Plan Explorer plan tree view shows accurate cardinality estimates and an even distribution of rows across threads (click to enlarge the image): With a warm data cache, the STATISTICS IO output shows that no physical I/O was needed, and all 41 partitions were touched: Running the query without actual execution plan or STATISTICS IO information for maximum performance, the query returns in around 2600ms. Execution Plan Analysis The first step toward improving on the execution plan produced by the query optimizer is to understand how it works, at least in outline. The two parallel Clustered Index Scans use multiple threads to read rows from tables T1 and T2. Parallel scan uses a demand-based scheme where threads are given page(s) to scan from the table as needed. This arrangement has certain important advantages, but does result in an unpredictable distribution of rows amongst threads. The point is that multiple threads cooperate to scan the whole table, but it is impossible to predict which rows end up on which threads. For correct results from the parallel hash join, the execution plan has to ensure that rows from T1 and T2 that might join are processed on the same thread. For example, if a row from T1 with join key value ‘1234’ is placed in thread 5’s hash table, the execution plan must guarantee that any rows from T2 that also have join key value ‘1234’ probe thread 5’s hash table for matches. The way this guarantee is enforced in this parallel hash join plan is by repartitioning rows to threads after each parallel scan. The two repartitioning exchanges route rows to threads using a hash function over the hash join keys. The two repartitioning exchanges use the same hash function so rows from T1 and T2 with the same join key must end up on the same hash join thread. Expensive Exchanges This business of repartitioning rows between threads can be very expensive, especially if a large number of rows is involved. The execution plan selected by the optimizer moves 5 million rows through one repartitioning exchange and around 15 million across the other. As a first step toward removing these exchanges, consider the execution plan selected by the optimizer if we join just one partition from each table, disallowing parallelism: SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = 1 AND $PARTITION.PFT(T2.TID) = 1 OPTION (MAXDOP 1); The optimizer has chosen a (one-to-many) merge join instead of a hash join. The single-partition query completes in around 100ms. If everything scaled linearly, we would expect that extending this strategy to all 40 populated partitions would result in an execution time around 4000ms. Using parallelism could reduce that further, perhaps to be competitive with the parallel hash join chosen by the optimizer. This raises a question. If the most efficient way to join one partition from each of the tables is to use a merge join, why does the optimizer not choose a merge join for the full query? Forcing a Merge Join Let’s force the optimizer to use a merge join on the test query using a hint: SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID OPTION (MERGE JOIN); This is the execution plan selected by the optimizer: This plan results in the same number of logical reads reported previously, but instead of 2600ms the query takes 5000ms. The natural explanation for this drop in performance is that the merge join plan is only using a single thread, whereas the parallel hash join plan could use multiple threads. Parallel Merge Join We can get a parallel merge join plan using the same query hint as before, and adding trace flag 8649: SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID OPTION (MERGE JOIN, QUERYTRACEON 8649); The execution plan is: This looks promising. It uses a similar strategy to distribute work across threads as seen for the parallel hash join. In practice though, performance is disappointing. On a typical run, the parallel merge plan runs for around 8400ms; slower than the single-threaded merge join plan (5000ms) and much worse than the 2600ms for the parallel hash join. We seem to be going backwards! The logical reads for the parallel merge are still exactly the same as before, with no physical IOs. The cardinality estimates and thread distribution are also still very good (click to enlarge): A big clue to the reason for the poor performance is shown in the wait statistics (captured by Plan Explorer Pro): CXPACKET waits require careful interpretation, and are most often benign, but in this case excessive waiting occurs at the repartitioning exchanges. Unlike the parallel hash join, the repartitioning exchanges in this plan are order-preserving ‘merging’ exchanges (because merge join requires ordered inputs): Parallelism works best when threads can just grab any available unit of work and get on with processing it. Preserving order introduces inter-thread dependencies that can easily lead to significant waits occurring. In extreme cases, these dependencies can result in an intra-query deadlock, though the details of that will have to wait for another time to explore in detail. The potential for waits and deadlocks leads the query optimizer to cost parallel merge join relatively highly, especially as the degree of parallelism (DOP) increases. This high costing resulted in the optimizer choosing a serial merge join rather than parallel in this case. The test results certainly confirm its reasoning. Collocated Joins In SQL Server 2008 and later, the optimizer has another available strategy when joining tables that share a common partition scheme. This strategy is a collocated join, also known as as a per-partition join. It can be applied in both serial and parallel execution plans, though it is limited to 2-way joins in the current optimizer. Whether the optimizer chooses a collocated join or not depends on cost estimation. The primary benefits of a collocated join are that it eliminates an exchange and requires less memory, as we will see next. Costing and Plan Selection The query optimizer did consider a collocated join for our original query, but it was rejected on cost grounds. The parallel hash join with repartitioning exchanges appeared to be a cheaper option. There is no query hint to force a collocated join, so we have to mess with the costing framework to produce one for our test query. Pretending that IOs cost 50 times more than usual is enough to convince the optimizer to use collocated join with our test query: -- Pretend IOs are 50x cost temporarily DBCC SETIOWEIGHT(50);   -- Co-located hash join SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID OPTION (RECOMPILE);   -- Reset IO costing DBCC SETIOWEIGHT(1); Collocated Join Plan The estimated execution plan for the collocated join is: The Constant Scan contains one row for each partition of the shared partitioning scheme, from 1 to 41. The hash repartitioning exchanges seen previously are replaced by a single Distribute Streams exchange using Demand partitioning. Demand partitioning means that the next partition id is given to the next parallel thread that asks for one. My test machine has eight logical processors, and all are available for SQL Server to use. As a result, there are eight threads in the single parallel branch in this plan, each processing one partition from each table at a time. Once a thread finishes processing a partition, it grabs a new partition number from the Distribute Streams exchange…and so on until all partitions have been processed. It is important to understand that the parallel scans in this plan are different from the parallel hash join plan. Although the scans have the same parallelism icon, tables T1 and T2 are not being co-operatively scanned by multiple threads in the same way. Each thread reads a single partition of T1 and performs a hash match join with the same partition from table T2. The properties of the two Clustered Index Scans show a Seek Predicate (unusual for a scan!) limiting the rows to a single partition: The crucial point is that the join between T1 and T2 is on TID, and TID is the partitioning column for both tables. A thread that processes partition ‘n’ is guaranteed to see all rows that can possibly join on TID for that partition. In addition, no other thread will see rows from that partition, so this removes the need for repartitioning exchanges. CPU and Memory Efficiency Improvements The collocated join has removed two expensive repartitioning exchanges and added a single exchange processing 41 rows (one for each partition id). Remember, the parallel hash join plan exchanges had to process 5 million and 15 million rows. The amount of processor time spent on exchanges will be much lower in the collocated join plan. In addition, the collocated join plan has a maximum of 8 threads processing single partitions at any one time. The 41 partitions will all be processed eventually, but a new partition is not started until a thread asks for it. Threads can reuse hash table memory for the new partition. The parallel hash join plan also had 8 hash tables, but with all 5,000,000 build rows loaded at the same time. The collocated plan needs memory for only 8 * 125,000 = 1,000,000 rows at any one time. Collocated Hash Join Performance The collated join plan has disappointing performance in this case. The query runs for around 25,300ms despite the same IO statistics as usual. This is much the worst result so far, so what went wrong? It turns out that cardinality estimation for the single partition scans of table T1 is slightly low. The properties of the Clustered Index Scan of T1 (graphic immediately above) show the estimation was for 121,951 rows. This is a small shortfall compared with the 125,000 rows actually encountered, but it was enough to cause the hash join to spill to physical tempdb: A level 1 spill doesn’t sound too bad, until you realize that the spill to tempdb probably occurs for each of the 41 partitions. As a side note, the cardinality estimation error is a little surprising because the system tables accurately show there are 125,000 rows in every partition of T1. Unfortunately, the optimizer uses regular column and index statistics to derive cardinality estimates here rather than system table information (e.g. sys.partitions). Collocated Merge Join We will never know how well the collocated parallel hash join plan might have worked without the cardinality estimation error (and the resulting 41 spills to tempdb) but we do know: Merge join does not require a memory grant; and Merge join was the optimizer’s preferred join option for a single partition join Putting this all together, what we would really like to see is the same collocated join strategy, but using merge join instead of hash join. Unfortunately, the current query optimizer cannot produce a collocated merge join; it only knows how to do collocated hash join. So where does this leave us? CROSS APPLY sys.partitions We can try to write our own collocated join query. We can use sys.partitions to find the partition numbers, and CROSS APPLY to get a count per partition, with a final step to sum the partial counts. The following query implements this idea: SELECT row_count = SUM(Subtotals.cnt) FROM ( -- Partition numbers SELECT p.partition_number FROM sys.partitions AS p WHERE p.[object_id] = OBJECT_ID(N'T1', N'U') AND p.index_id = 1 ) AS P CROSS APPLY ( -- Count per collocated join SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals; The estimated plan is: The cardinality estimates aren’t all that good here, especially the estimate for the scan of the system table underlying the sys.partitions view. Nevertheless, the plan shape is heading toward where we would like to be. Each partition number from the system table results in a per-partition scan of T1 and T2, a one-to-many Merge Join, and a Stream Aggregate to compute the partial counts. The final Stream Aggregate just sums the partial counts. Execution time for this query is around 3,500ms, with the same IO statistics as always. This compares favourably with 5,000ms for the serial plan produced by the optimizer with the OPTION (MERGE JOIN) hint. This is another case of the sum of the parts being less than the whole – summing 41 partial counts from 41 single-partition merge joins is faster than a single merge join and count over all partitions. Even so, this single-threaded collocated merge join is not as quick as the original parallel hash join plan, which executed in 2,600ms. On the positive side, our collocated merge join uses only one logical processor and requires no memory grant. The parallel hash join plan used 16 threads and reserved 569 MB of memory:   Using a Temporary Table Our collocated merge join plan should benefit from parallelism. The reason parallelism is not being used is that the query references a system table. We can work around that by writing the partition numbers to a temporary table (or table variable): SET STATISTICS IO ON; DECLARE @s datetime2 = SYSUTCDATETIME();   CREATE TABLE #P ( partition_number integer PRIMARY KEY);   INSERT #P (partition_number) SELECT p.partition_number FROM sys.partitions AS p WHERE p.[object_id] = OBJECT_ID(N'T1', N'U') AND p.index_id = 1;   SELECT row_count = SUM(Subtotals.cnt) FROM #P AS p CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals;   DROP TABLE #P;   SELECT DATEDIFF(Millisecond, @s, SYSUTCDATETIME()); SET STATISTICS IO OFF; Using the temporary table adds a few logical reads, but the overall execution time is still around 3500ms, indistinguishable from the same query without the temporary table. The problem is that the query optimizer still doesn’t choose a parallel plan for this query, though the removal of the system table reference means that it could if it chose to: In fact the optimizer did enter the parallel plan phase of query optimization (running search 1 for a second time): Unfortunately, the parallel plan found seemed to be more expensive than the serial plan. This is a crazy result, caused by the optimizer’s cost model not reducing operator CPU costs on the inner side of a nested loops join. Don’t get me started on that, we’ll be here all night. In this plan, everything expensive happens on the inner side of a nested loops join. Without a CPU cost reduction to compensate for the added cost of exchange operators, candidate parallel plans always look more expensive to the optimizer than the equivalent serial plan. Parallel Collocated Merge Join We can produce the desired parallel plan using trace flag 8649 again: SELECT row_count = SUM(Subtotals.cnt) FROM #P AS p CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals OPTION (QUERYTRACEON 8649); The actual execution plan is: One difference between this plan and the collocated hash join plan is that a Repartition Streams exchange operator is used instead of Distribute Streams. The effect is similar, though not quite identical. The Repartition uses round-robin partitioning, meaning the next partition id is pushed to the next thread in sequence. The Distribute Streams exchange seen earlier used Demand partitioning, meaning the next partition id is pulled across the exchange by the next thread that is ready for more work. There are subtle performance implications for each partitioning option, but going into that would again take us too far off the main point of this post. Performance The important thing is the performance of this parallel collocated merge join – just 1350ms on a typical run. The list below shows all the alternatives from this post (all timings include creation, population, and deletion of the temporary table where appropriate) from quickest to slowest: Collocated parallel merge join: 1350ms Parallel hash join: 2600ms Collocated serial merge join: 3500ms Serial merge join: 5000ms Parallel merge join: 8400ms Collated parallel hash join: 25,300ms (hash spill per partition) The parallel collocated merge join requires no memory grant (aside from a paltry 1.2MB used for exchange buffers). This plan uses 16 threads at DOP 8; but 8 of those are (rather pointlessly) allocated to the parallel scan of the temporary table. These are minor concerns, but it turns out there is a way to address them if it bothers you. Parallel Collocated Merge Join with Demand Partitioning This final tweak replaces the temporary table with a hard-coded list of partition ids (dynamic SQL could be used to generate this query from sys.partitions): SELECT row_count = SUM(Subtotals.cnt) FROM ( VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10), (11),(12),(13),(14),(15),(16),(17),(18),(19),(20), (21),(22),(23),(24),(25),(26),(27),(28),(29),(30), (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41) ) AS P (partition_number) CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals OPTION (QUERYTRACEON 8649); The actual execution plan is: The parallel collocated hash join plan is reproduced below for comparison: The manual rewrite has another advantage that has not been mentioned so far: the partial counts (per partition) can be computed earlier than the partial counts (per thread) in the optimizer’s collocated join plan. The earlier aggregation is performed by the extra Stream Aggregate under the nested loops join. The performance of the parallel collocated merge join is unchanged at around 1350ms. Final Words It is a shame that the current query optimizer does not consider a collocated merge join (Connect item closed as Won’t Fix). The example used in this post showed an improvement in execution time from 2600ms to 1350ms using a modestly-sized data set and limited parallelism. In addition, the memory requirement for the query was almost completely eliminated  – down from 569MB to 1.2MB. The problem with the parallel hash join selected by the optimizer is that it attempts to process the full data set all at once (albeit using eight threads). It requires a large memory grant to hold all 5 million rows from table T1 across the eight hash tables, and does not take advantage of the divide-and-conquer opportunity offered by the common partitioning. The great thing about the collocated join strategies is that each parallel thread works on a single partition from both tables, reading rows, performing the join, and computing a per-partition subtotal, before moving on to a new partition. From a thread’s point of view… If you have trouble visualizing what is happening from just looking at the parallel collocated merge join execution plan, let’s look at it again, but from the point of view of just one thread operating between the two Parallelism (exchange) operators. Our thread picks up a single partition id from the Distribute Streams exchange, and starts a merge join using ordered rows from partition 1 of table T1 and partition 1 of table T2. By definition, this is all happening on a single thread. As rows join, they are added to a (per-partition) count in the Stream Aggregate immediately above the Merge Join. Eventually, either T1 (partition 1) or T2 (partition 1) runs out of rows and the merge join stops. The per-partition count from the aggregate passes on through the Nested Loops join to another Stream Aggregate, which is maintaining a per-thread subtotal. Our same thread now picks up a new partition id from the exchange (say it gets id 9 this time). The count in the per-partition aggregate is reset to zero, and the processing of partition 9 of both tables proceeds just as it did for partition 1, and on the same thread. Each thread picks up a single partition id and processes all the data for that partition, completely independently from other threads working on other partitions. One thread might eventually process partitions (1, 9, 17, 25, 33, 41) while another is concurrently processing partitions (2, 10, 18, 26, 34) and so on for the other six threads at DOP 8. The point is that all 8 threads can execute independently and concurrently, continuing to process new partitions until the wider job (of which the thread has no knowledge!) is done. This divide-and-conquer technique can be much more efficient than simply splitting the entire workload across eight threads all at once. Related Reading Understanding and Using Parallelism in SQL Server Parallel Execution Plans Suck © 2013 Paul White – All Rights Reserved Twitter: @SQL_Kiwi

    Read the article

  • In SQL, if we rename INNER JOIN as INTERSECT JOIN, LEFT OUTER JOIN as LEFT UNION JOIN, and FULL OUTE

    - by Jian Lin
    In SQL, the name Join gives an idea of "merging" or a sense of "union", making something bigger. But in fact, as in the other post http://stackoverflow.com/questions/2706051/in-sql-a-join-is-actually-an-intersection-and-it-is-also-a-linkage-or-a-sidew it turns out that a Join (Inner Join) is actually an Intersection. So if we think of Join = Inner Join = Intersect Join Left Outer Join = Left Union Join Full Outer Join = Full Union Join = Union Join then we always get a feel of what's happening, and maybe never forget what they are easily. In a way, we can think of Intersect as "making it less", therefore it is excluding something. That's why the name "Join" won't go with the idea of "Intersect". But in fact, both Intersect and Union can be thought of as: Union: bringing something together and merge them unconditionally. Intersect: bringing something together and merge them based on some condition. so the "bringing something together" is probably what "Join" is all about. It is like, Intersection is a "half glass of water" -- we can thinking of it as "excluding something" or as "bringing something together and accepting the common ones". So if the word "Intersect Join" is used, maybe a clear picture is there, and "Union Join" can be a clear picture too. Maybe the word "Inner Join" and "Outer Join" is very clear when we use SQL a lot. Somehow, the word "Outer" tends to give a feeling that it is "outside" and excluding something rather than a "Union".

    Read the article

  • Constructing a hash table/hash function.

    - by nn
    Hi, I would like to construct a hash table that looks up keys in sequences (strings) of bytes ranging from 1 to 15 bytes. I would like to store an integer value, so I imagine an array for hashing would suffice. I'm having difficulty conceptualizing how to construct a hash function such that given the key would give an index into the array. Any assistance would be much appreiated. The maximum number of entries in the hash is: 4081*15 + 4081*14 + ... 4081 = 4081((15*(16))/2) = 489720. So for example: int table[489720]; int lookup(unsigned char *key) { int index = hash(key); return table[index]; } How can I compute hash(key). I'd preferably like to get a perfect hash function. Thanks.

    Read the article

  • Problem with hash function: hash(1) == hash(1.0)

    - by mtasic
    I have an instance of dict with ints, floats, strings as keys, but the problem is when there are a as int and b as float, and float(a) == b, then their hash values are the same, and thats what I do NOT want to get because I need unique hash vales for this cases in order to get corresponding values. Example: d = {1:'1', 1.0:'1.0', '1':1, '1.0':1.0} d[1] == '1.0' d[1.0] == '1.0' d['1'] == 1 d['1.0'] == 1.0 What I need is: d = {1:'1', 1.0:'1.0', '1':1, '1.0':1.0} d[1] == '1' d[1.0] == '1.0' d['1'] == 1 d['1.0'] == 1.0

    Read the article

  • If Inner Join can be thought of as Cross Join but filtering out the records satisfying the condition

    - by Jian Lin
    If an Inner Join can be thought of as a cross join and then getting the records that satisfy the condition, then a LEFT OUTER JOIN can be thought of as that, plus ONE record on the left table that doesn't satisfy the condition. In other words, it is not a cross join that "goes easy" on the left records (even when the condition is not satisfied), because then the left record can appear many times (as many as how many records there are in the right table). So the LEFT OUTER JOIN is the Cross JOIN with the records satisfying the condition, plus ONE record from the LEFT TABLE that doesn't satisfy the condition.

    Read the article

  • Thinking of an Inner Join as a Cross Join and then satisfying some condition(s)?

    - by Jian Lin
    It seems like the safest way to think of an Inner Join is to think of it as a Cross Join and then satisfying some condition(s)? Because the equi-join can be obvious, but the non-equi-join can be a bit confusing. But if we always use the Cross Join, and then filter out the ones satisfying the condition, then we get the resulting table. In other words, we can always analyze it by using the first record on the left table, and then go through every single records on the right, and then repeat that for 2nd record on the left, and for the 3rd, 4th, ... etc. So in our mind, we can analyze it using this way, and it is like O(n^2), although what happens in the DBMS maybe that it is a lot faster (when an index is present). Is there another good way to think of it besides this method?

    Read the article

  • It seems like the safest way to think of an Inner Join is to think of it as a Cross Join and then sa

    - by Jian Lin
    It seems like the safest way to think of an Inner Join is to think of it as a Cross Join and then satisfying some condition(s)? Because the equi-join can be obvious, but the non-equi-join can be a bit confusing. But if we always use the Cross Join, and then filter out the ones satisfying the condition, then we get the resulting table. In other words, we can always analyze it by using the first record on the left table, and then go through every single records on the right, and then repeat that for 2nd record on the left, and for the 3rd, 4th, ... etc. So in our mind, we can analyze it using this way, and it is like O(n^2), although what happens in the DBMS maybe that it is a lot faster (when an index is present). Is there another good way to think of it besides this method?

    Read the article

  • SQL – Difference Between INNER JOIN and JOIN

    - by Pinal Dave
    Here is the follow up question to my earlier question SQL – Difference between != and Operator <> used for NOT EQUAL TO Operation. There was a pretty good discussion about this subject earlier and lots of people participated with their opinion. Though the answer was very simple but the conversation was indeed delightful and was indeed very informative. In this blog post I have another following up question to all of you. What is the difference between INNER JOIN and JOIN? If you are working with database you will find developers use above both the kinds of the joins in their SQL Queries. Here is the quick example of the same. Query using INNER JOIN SELECT * FROM Table1 INNER JOIN  Table2 ON Table1.Col1 = Table2.Col1 Query using JOIN SELECT * FROM Table1 JOIN  Table2 ON Table1.Col1 = Table2.Col1 The question is what is the difference between above two syntax. Here is the answer – They are equal to each other. There is absolutely no difference between them. They are equal in performance as well as implementation. JOIN is actually shorter version of INNER JOIN. Personally I prefer to write INNER JOIN because it is much cleaner to read and it avoids any confusion if there is related to JOIN. For example if users had written INNER JOIN instead of JOIN there would have been no confusion in mind and hence there was no need to have original question. Here is the question back to you - Which one of the following syntax do you use when you are inner joining two tables – INNER JOIN or JOIN? and Why? Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: PostADay, SQL, SQL Authority, SQL Joins, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology

    Read the article

  • SQL Server 2005 RIGHT OUTER JOIN not working

    - by CheeseConQueso
    I'm looking up access logs for specific courses. I need to show all the courses even if they don't exist in the logs table. Hence the outer join.... but after trying (presumably) all of the variations of LEFT OUTER, RIGHT OUTER, INNER and placement of the tables within the SQL code, I couldn't get my result. Here's what I am running: SELECT (a.first_name+' '+a.last_name) instructor, c.course_id, COUNT(l.access_date) course_logins, a.logins system_logins, MAX(l.access_date) last_course_login, a.last_login last_system_login FROM lsn_logs l RIGHT OUTER JOIN courses c ON l.course_id = c.course_id, accounts a WHERE l.object_id = 'LOGIN' AND c.course_type = 'COURSE' AND c.course_id NOT LIKE '%TEST%' AND a.account_rights > 2 AND l.user_id = a.username AND ((a.first_name+' '+a.last_name) = c.instructor) GROUP BY c.course_id, a.first_name, a.last_name, a.last_login, a.logins, c.instructor ORDER BY a.last_name, a.first_name, c.course_id, course_logins DESC Is it something in the WHERE clause that's preventing me from getting course_id's that don't exist in lsn_logs? Is it the way I'm joining the tables? Again, in short, I want all course_id's regardless of their existence in lsn_logs.

    Read the article

  • Are hash collisions with different file sizes just as likely as same file size?

    - by rwmnau
    I'm hashing a large number of files, and to avoid hash collisions, I'm also storing a file's original size - that way, even if there's a hash collision, it's extremely unlikely that the file sizes will also be identical. Is this sound (a hash collision is equally likely to be of any size), or do I need another piece of information (if a collision is more likely to also be the same length as the original). Or, more generally: Is every file just as likely to produce a particular hash, regardless of original file size?

    Read the article

  • How to join data frames in R (inner, outer, left, right)?

    - by Dan Goldstein
    Given two data frames df1 = data.frame(CustomerId=c(1:6),Product=c(rep("Toaster",3),rep("Radio",3))) df2 = data.frame(CustomerId=c(2,4,6),State=c(rep("Alabama",2),rep("Ohio",1))) > df1 CustomerId Product 1 Toaster 2 Toaster 3 Toaster 4 Radio 5 Radio 6 Radio > df2 CustomerId State 2 Alabama 4 Alabama 6 Ohio How can I do database style, i.e., sql style, joins? That is, how do I get: An inner join of df1 and df1 An outer join of df1 and df2 A left outer join of df1 and df2 A right outer join of df1 and df2 P.S. IKT-JARQ (I Know This - Just Adding R Questions) Extra credit: How can I do a sql style select statement?

    Read the article

  • Merge Join component sorted outputs [SSIS]

    - by jamiet
    One question that I have been asked a few times of late in regard to performance tuning SSIS data flows is this: Why isn’t the Merge Join output sorted (i.e.IsSorted=True)? This is a fair question. After all both of the Merge Join inputs are sorted, hence why wouldn’t the output be sorted as well? Well here’s a little secret, the Merge Join output IS sorted! There’s a caveat though – it is only under certain circumstances and SSIS itself doesn’t do a good job of informing you of it. Let’s take a look at an example. Here we have a dataflow that consumes data from the [AdventureWorks2008].[Sales].[SalesOrderHeader] & [AdventureWorks2008].[Sales].[SalesOrderDetail] tables then joins them using a Merge Join component: Let’s take a look inside the editor of the Merge Join: We are joining on the [SalesOrderId] field (which is what the two inputs just happen to be sorted upon). We are also putting [SalesOrderHeader].[SalesOrderId] into the output. Believe it or not the output from this Merge Join component is sorted (i.e. has IsSorted=True) but unfortunately the Merge Join component does not have an Advanced Editor hence it is hidden away from us. There are a couple of ways to prove to you that is the case; I could open up the package XML inside the .dtsx file and show you the metadata but there is an easier way than that – I can attach a Sort component to the output. Take a look: Notice that the Sort component is attempting to sort on the [SalesOrderId] column. This gives us the following warning: Validation warning. DFT Get raw data: {992B7C9A-35AD-47B9-A0B0-637F7DDF93EB}: The data is already sorted as specified so the transform can be removed. The warning proves that the output from the Merge Join is sorted! It must be noted that the Merge Join output will only have IsSorted=True if at least one of the join columns is included in the output. So there you go, the Merge Join component can indeed produce a sorted output and that’s very useful in order to avoid unnecessary expensive Sort operations downstream. Hope this is useful to someone out there! @Jamiet  P.S. Thank you to Bob Bojanic on the SSIS product team who pointed this out to me!

    Read the article

  • Merge Join component sorted outputs [SSIS]

    - by jamiet
    One question that I have been asked a few times of late in regard to performance tuning SSIS data flows is this: Why isn’t the Merge Join output sorted (i.e.IsSorted=True)? This is a fair question. After all both of the Merge Join inputs are sorted, hence why wouldn’t the output be sorted as well? Well here’s a little secret, the Merge Join output IS sorted! There’s a caveat though – it is only under certain circumstances and SSIS itself doesn’t do a good job of informing you of it. Let’s take a look at an example. Here we have a dataflow that consumes data from the [AdventureWorks2008].[Sales].[SalesOrderHeader] & [AdventureWorks2008].[Sales].[SalesOrderDetail] tables then joins them using a Merge Join component: Let’s take a look inside the editor of the Merge Join: We are joining on the [SalesOrderId] field (which is what the two inputs just happen to be sorted upon). We are also putting [SalesOrderHeader].[SalesOrderId] into the output. Believe it or not the output from this Merge Join component is sorted (i.e. has IsSorted=True) but unfortunately the Merge Join component does not have an Advanced Editor hence it is hidden away from us. There are a couple of ways to prove to you that is the case; I could open up the package XML inside the .dtsx file and show you the metadata but there is an easier way than that – I can attach a Sort component to the output. Take a look: Notice that the Sort component is attempting to sort on the [SalesOrderId] column. This gives us the following warning: Validation warning. DFT Get raw data: {992B7C9A-35AD-47B9-A0B0-637F7DDF93EB}: The data is already sorted as specified so the transform can be removed. The warning proves that the output from the Merge Join is sorted! It must be noted that the Merge Join output will only have IsSorted=True if at least one of the join columns is included in the output. So there you go, the Merge Join component can indeed produce a sorted output and that’s very useful in order to avoid unnecessary expensive Sort operations downstream. Hope this is useful to someone out there! @Jamiet  P.S. Thank you to Bob Bojanic on the SSIS product team who pointed this out to me!

    Read the article

  • Basics of Join Predicate Pushdown in Oracle

    - by Maria Colgan
    Happy New Year to all of our readers! We hope you all had a great holiday season. We start the new year by continuing our series on Optimizer transformations. This time it is the turn of Predicate Pushdown. I would like to thank Rafi Ahmed for the content of this blog.Normally, a view cannot be joined with an index-based nested loop (i.e., index access) join, since a view, in contrast with a base table, does not have an index defined on it. A view can only be joined with other tables using three methods: hash, nested loop, and sort-merge joins. Introduction The join predicate pushdown (JPPD) transformation allows a view to be joined with index-based nested-loop join method, which may provide a more optimal alternative. In the join predicate pushdown transformation, the view remains a separate query block, but it contains the join predicate, which is pushed down from its containing query block into the view. The view thus becomes correlated and must be evaluated for each row of the outer query block. These pushed-down join predicates, once inside the view, open up new index access paths on the base tables inside the view; this allows the view to be joined with index-based nested-loop join method, thereby enabling the optimizer to select an efficient execution plan. The join predicate pushdown transformation is not always optimal. The join predicate pushed-down view becomes correlated and it must be evaluated for each outer row; if there is a large number of outer rows, the cost of evaluating the view multiple times may make the nested-loop join suboptimal, and therefore joining the view with hash or sort-merge join method may be more efficient. The decision whether to push down join predicates into a view is determined by evaluating the costs of the outer query with and without the join predicate pushdown transformation under Oracle's cost-based query transformation framework. The join predicate pushdown transformation applies to both non-mergeable views and mergeable views and to pre-defined and inline views as well as to views generated internally by the optimizer during various transformations. The following shows the types of views on which join predicate pushdown is currently supported. UNION ALL/UNION view Outer-joined view Anti-joined view Semi-joined view DISTINCT view GROUP-BY view Examples Consider query A, which has an outer-joined view V. The view cannot be merged, as it contains two tables, and the join between these two tables must be performed before the join between the view and the outer table T4. A: SELECT T4.unique1, V.unique3 FROM T_4K T4,            (SELECT T10.unique3, T10.hundred, T10.ten             FROM T_5K T5, T_10K T10             WHERE T5.unique3 = T10.unique3) VWHERE T4.unique3 = V.hundred(+) AND       T4.ten = V.ten(+) AND       T4.thousand = 5; The following shows the non-default plan for query A generated by disabling join predicate pushdown. When query A undergoes join predicate pushdown, it yields query B. Note that query B is expressed in a non-standard SQL and shows an internal representation of the query. B: SELECT T4.unique1, V.unique3 FROM T_4K T4,           (SELECT T10.unique3, T10.hundred, T10.ten             FROM T_5K T5, T_10K T10             WHERE T5.unique3 = T10.unique3             AND T4.unique3 = V.hundred(+)             AND T4.ten = V.ten(+)) V WHERE T4.thousand = 5; The execution plan for query B is shown below. In the execution plan BX, note the keyword 'VIEW PUSHED PREDICATE' indicates that the view has undergone the join predicate pushdown transformation. The join predicates (shown here in red) have been moved into the view V; these join predicates open up index access paths thereby enabling index-based nested-loop join of the view. With join predicate pushdown, the cost of query A has come down from 62 to 32.  As mentioned earlier, the join predicate pushdown transformation is cost-based, and a join predicate pushed-down plan is selected only when it reduces the overall cost. Consider another example of a query C, which contains a view with the UNION ALL set operator.C: SELECT R.unique1, V.unique3 FROM T_5K R,            (SELECT T1.unique3, T2.unique1+T1.unique1             FROM T_5K T1, T_10K T2             WHERE T1.unique1 = T2.unique1             UNION ALL             SELECT T1.unique3, T2.unique2             FROM G_4K T1, T_10K T2             WHERE T1.unique1 = T2.unique1) V WHERE R.unique3 = V.unique3 and R.thousand < 1; The execution plan of query C is shown below. In the above, 'VIEW UNION ALL PUSHED PREDICATE' indicates that the UNION ALL view has undergone the join predicate pushdown transformation. As can be seen, here the join predicate has been replicated and pushed inside every branch of the UNION ALL view. The join predicates (shown here in red) open up index access paths thereby enabling index-based nested loop join of the view. Consider query D as an example of join predicate pushdown into a distinct view. We have the following cardinalities of the tables involved in query D: Sales (1,016,271), Customers (50,000), and Costs (787,766).  D: SELECT C.cust_last_name, C.cust_city FROM customers C,            (SELECT DISTINCT S.cust_id             FROM sales S, costs CT             WHERE S.prod_id = CT.prod_id and CT.unit_price > 70) V WHERE C.cust_state_province = 'CA' and C.cust_id = V.cust_id; The execution plan of query D is shown below. As shown in XD, when query D undergoes join predicate pushdown transformation, the expensive DISTINCT operator is removed and the join is converted into a semi-join; this is possible, since all the SELECT list items of the view participate in an equi-join with the outer tables. Under similar conditions, when a group-by view undergoes join predicate pushdown transformation, the expensive group-by operator can also be removed. With the join predicate pushdown transformation, the elapsed time of query D came down from 63 seconds to 5 seconds. Since distinct and group-by views are mergeable views, the cost-based transformation framework also compares the cost of merging the view with that of join predicate pushdown in selecting the most optimal execution plan. Summary We have tried to illustrate the basic ideas behind join predicate pushdown on different types of views by showing example queries that are quite simple. Oracle can handle far more complex queries and other types of views not shown here in the examples. Again many thanks to Rafi Ahmed for the content of this blog post.

    Read the article

  • SQL SERVER- Differences Between Left Join and Left Outer Join

    - by pinaldave
    There are a few questions that I had decided not to discuss on this blog because I think they are very simple and many of us know it. Many times, I even receive not-so positive notes from several readers when I am writing something simple. However, assuming that we know all and beginners should know everything is not the right attitude. Since day 1, I have been keeping a small journal regarding questions that I receive in this blog. There are around 200+ questions I receive every day through emails, comments and occasional phone calls. Yesterday, I received a comment with the following question: What are the differences between Left Join and Left Outer Join? Click here to read original comment. This question has triggered the threshold of receiving the same question repeatedly. Here is the answer: There is absolutely no difference between LEFT JOIN and LEFT OUTER JOIN. The same is true for RIGHT JOIN and RIGHT OUTER JOIN. When you use LEFT JOIN keyword in SQL Server, it means LEFT OUTER JOIN only. I have already written in-depth visual diagram discussing the JOINs. I encourage all of you to read the article for further understanding of the JOINs: Read Introduction to JOINs – Basic of JOINs Reference: Pinal Dave (http://blog.SQLAuthority.com) Filed under: Pinal Dave, PostADay, SQL, SQL Authority, SQL Joins, SQL Query, SQL Scripts, SQL Server, SQL Tips and Tricks, T SQL, Technology

    Read the article

  • Joining on a common table, how do you get a FULL OUTER JOIN to expand on another table?

    - by stimpy77
    I've scoured StackOverflow and Google for an answer to this problem. I'm trying to create a Microsot SQL Server 2008 view. Not a stored procedure. Not a function. Just a query (i.e. a view). I have three tables. The first table defines a common key, let's say "CompanyID". The other two tables have a sometimes-common field, let's say "EmployeeName". I want a single table result that, when my WHERE clause says "WHERE CompanyID = 12" looks like this: CompanyID | TableA | TableB 12 | John Doe | John Doe 12 | Betty Sue | NULL 12 | NULL | Billy Bob I've tried a FULL OUTER JOIN that looks like this: SELECT Company.CompanyID, TableA.EmployeeName, TableB.EmployeeName FROM Company FULL OUTER JOIN TableA ON Company.CompanyID = TableA.CompanyID FULL OUTER JOIN TableB ON Company.CompanyID = TableB.CompanyID AND (TableA.EmployeeName IS NULL OR TableB.EmployeeName IS NULL OR TableB.EmployeeName = TableA.EmployeeName) I'm only getting the NULL from one matched table, I'm not getting the expansion for the other table. In the above sample, I'm basically only getting the first and third rows and not the second. Can someone help me create this query and show me how this is done correctly? BTW I already have a stored procedure that looks very clean and populates an in-memory table, but that isn't what I want. Thanks.

    Read the article

  • Refactoring this code that produces a reverse-lookup hash from another hash

    - by Frank Joseph Mattia
    This code is based on the idea of a Form Object http://blog.codeclimate.com/blog/2012/10/17/7-ways-to-decompose-fat-activerecord-models/ (see #3 if unfamiliar with the concept). My actual code in question may be found here: https://gist.github.com/frankjmattia/82a9945f30bde29eba88 The code takes a hash of objects/attributes and creates a reverse lookup hash to keep track of their delegations to do this. delegate :first_name, :email, to: :user, prefix: true But I am manually creating the delegations from a hash like this: DELEGATIONS = { user: [ :first_name, :email ] } At runtime when I want to look up the translated attribute names for the objects, all I have to go on are the delegated/prefixed (have to use a prefix to avoid naming collisions) attribute names like :user_first_name which aren't in sync with the rails i18n way of doing it: en: activerecord: attributes: user: email: 'Email Address' The code I have take the above delegations hash and turns it into a lookup table so when I override human_attribute_name I can get back the original attribute name and its class. Then I send #human_attribute_name to the original class with the original attribute name as its argument. The code I've come up with works but it is ugly to say the least. I've never really used #inject so this was a crash course for me and am quite unsure if this code effective way of solving my problem. Could someone recommend a simpler solution that does not require a reverse lookup table or does that seem like the right way to go? Thanks, - FJM

    Read the article

  • Are has collisions with different file sizes just as likely as same file size?

    - by rwmnau
    I'm hashing a large number of files, and to avoid hash collisions, I'm also storing a file's original size - that way, even if there's a hash collision, it's extrememly unlikely that the file sizes will also be identical. Is this sound (a hash collision is equally likely to be of any size), or do I need another piece of information (if a collision is more likely to also be the same length as the original). Or, more generally: Is every file just as likely to produce a particular hash, regardless of original file size?

    Read the article

  • Most optimal order (of joins) for left join

    - by Ram
    I have 3 tables Table1 (with 1020690 records), Table2(with 289425 records), Table 3(with 83692 records).I have something like this SELECT * FROM Table1 T1 /* OK fine select * is bad when not all columns are needed, this is just an example*/ LEFT JOIN Table2 T2 ON T1.id=T2.id LEFT JOIN Table3 T3 ON T1.id=T3.id and a query like this SELECT * FROM Table1 T1 LEFT JOIN Table3 T3 ON T1.id=T3.id LEFT JOIN Table2 T2 ON T1.id=T2.id The query plan shows me that it uses 2 Merge Join for both the joins. For the first query, the first merge is with T1 and T2 and then with T3. For the second query, the first merge is with T1 and T3 and then with T2. Both these queries take about the same time(40 seconds approx.) or sometimes Query1 takes couple of seconds longer. So my question is, does the join order matter ?

    Read the article

  • LINQ to SQL - Left Outer Join with multiple join conditions

    - by dan
    I have the following SQL which I am trying to translate to LINQ: SELECT f.value FROM period as p LEFT OUTER JOIN facts AS f ON p.id = f.periodid AND f.otherid = 17 WHERE p.companyid = 100 I have seen the typical implementation of the left outer join (ie. into x from y in x.DefaultIfEmpty() etc.) but am unsure how to introduce the other join condition ('AND f.otherid = 17') EDIT Why is the 'AND f.otherid = 17' condition part of the JOIN instead of in the WHERE clause? Because f may not exist for some rows and I still want these rows to be included. If the condition is applied in the WHERE clause, after the JOIN - then I don't get the behaviour I want. Unfortunately this: from p in context.Periods join f in context.Facts on p.id equals f.periodid into fg from fgi in fg.DefaultIfEmpty() where p.companyid == 100 && fgi.otherid == 17 select f.value seems to be equivalent to this: SELECT f.value FROM period as p LEFT OUTER JOIN facts AS f ON p.id = f.periodid WHERE p.companyid = 100 && AND f.otherid = 17 which is not quite what I'm after.

    Read the article

1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >