When to use CTEs to encapsulate sub-results, and when to let the RDBMS worry about massive joins.
- by IanC
This is a SQL theory question. I can provide an example, but I don't think it's needed to make my point. Anyone experienced with SQL will immediately know what I'm talking about.
Usually we use joins to minimize the number of records due to matching the left and right rows. However, under certain conditions, joining tables cause a multiplication of results where the result is all permutations of the left and right records.
I have a database which has 3 or 4 such joins. This turns what would be a few records into a multitude. My concern is that the tables will be large in production, so the number of these joined rows will be immense. Further, heavy math is performed on each row, and the idea of performing math on duplicate rows is enough to make anyone shudder.
I have two questions. The first is, is this something I should care about, or will SQL Server intelligently realize these rows are all duplicates and optimize all processing accordingly?
The second is, is there any advantage to grouping each part of the query so as to get only the distinct values going into the next part of the query, using something like:
WITH t1 AS (
SELECT DISTINCT... [or GROUP BY]
),
t2 AS (
SELECT DISTINCT...
),
t3 AS (
SELECT DISTINCT...
)
SELECT...
I have often seen the use of DISTINCT applied to subqueries. There is obviously a reason for doing this. However, I'm talking about something a little different and perhaps more subtle and tricky.