There are interesting things to be learned from even
the simplest queries. For example, imagine you are given
the task of writing a query to list AdventureWorks product names where
the product has at least one entry in
the transaction history table, but fewer than ten. One possible query to meet that specification is: SELECT
p.Name
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
p.ProductID = th.ProductID
GROUP BY
p.ProductID,
p.Name
HAVING
COUNT_BIG(*) < 10;
That query correctly
returns 23 rows (execution plan and data sample shown below):
The execution plan looks a bit different from
the written form of
the query:
the base tables are accessed in reverse order, and
the aggregation is performed before
the join.
The general idea is to read all rows from
the history table, compute
the count of rows grouped by ProductID, merge join
the results to
the Product table on ProductID, and finally filter to only return rows where
the count is less than ten.
This ‘fully-optimized’ plan has an estimated cost of around 0.33 units.
The reason for
the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push
the Filter down past
the join too? To answer that, let’s look at some other ways to formulate this query. This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones.
The first query is an attempt to reverse-engineer T-SQL from
the optimized query plan shown above. It joins
the result of pre-aggregating
the history table to
the Product table before filtering:
SELECT p.Name
FROM
(
SELECT
th.ProductID,
cnt = COUNT_BIG(*)
FROM Production.TransactionHistory AS th
GROUP BY
th.ProductID
) AS q1
JOIN Production.Product AS p
ON p.ProductID = q1.ProductID
WHERE
q1.cnt < 10;
Perhaps a little surprisingly, we get a slightly different execution plan:
The results are
the same (23 rows) but this time
the Filter is pushed below
the join!
The optimizer chooses nested loops for
the join, because
the cardinality estimate for rows passing
the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and
the Filter still appears below
the join. In yet another variation,
the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in
the “q1” sub-query instead of in
the WHERE clause as written above.
The reason this predicate can be pushed past
the join in this query form, but not in
the original formulation is simply an optimizer limitation – it does make efforts (primarily during
the simplification phase) to encourage logically-equivalent query specifications to produce
the same execution plan, but
the implementation is not completely comprehensive.
Moving on to a second example,
the following query specification results from phrasing
the requirement as “list
the products where there exists fewer than ten correlated rows in
the history table”:
SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS
(
SELECT *
FROM Production.TransactionHistory AS th
WHERE th.ProductID = p.ProductID
HAVING COUNT_BIG(*) < 10
);
Unfortunately, this query produces an incorrect result (86 rows):
The problem is that it lists products with no history rows, though
the reasons are interesting.
The COUNT_BIG(*) in
the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when
the input is an empty set. In
the case of
the COUNT aggregate,
the result of aggregating
the empty set is zero (the other standard aggregates produce a NULL). To make
the point really clear, let’s look at product 709, which happens to be one for which no history rows exist:
-- Scalar aggregate
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS th
WHERE th.ProductID = 709;
-- Vector aggregate
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS th
WHERE th.ProductID = 709
GROUP BY th.ProductID;
The estimated execution plans for these two statements are almost identical:
You might expect
the Stream Aggregate to have a Group By for
the second statement, but this is not
the case.
The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have
the same value for ProductID and
the Group By is optimized away.
In fact there are some minor differences between
the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas
the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and
the other is a vector aggregate. This is something I would like to see exposed in show plan so I suggested it on Connect. Anyway,
the results of running
the two queries show
the difference at runtime:
The scalar aggregate (no GROUP BY)
returns a result of zero, whereas
the vector aggregate (with a GROUP BY clause)
returns nothing at all. Returning to our EXISTS query, we could ‘fix’ it by changing
the HAVING clause to reject rows where
the scalar aggregate
returns zero:
SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS
(
SELECT *
FROM Production.TransactionHistory AS th
WHERE th.ProductID = p.ProductID
HAVING COUNT_BIG(*) BETWEEN 1 AND 9
);
The query now
returns the correct 23 rows:
Unfortunately,
the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for
the earlier plans. Let’s try adding a redundant GROUP BY instead of changing
the HAVING clause:
SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS
(
SELECT *
FROM Production.TransactionHistory AS th
WHERE th.ProductID = p.ProductID
GROUP BY th.ProductID
HAVING COUNT_BIG(*) < 10
);
Not only do we now get correct results (23 rows), this is
the execution plan:
I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :)
The simple addition of a redundant GROUP BY has resulted in
the EXISTS form of
the query being transformed into exactly
the same optimal plan we found earlier. What’s more, in SQL Server 2008 and later, we can replace
the odd-looking GROUP BY with an explicit GROUP BY on
the empty set:
SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS
(
SELECT *
FROM Production.TransactionHistory AS th
WHERE th.ProductID = p.ProductID
GROUP BY ()
HAVING COUNT_BIG(*) < 10
);
I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too). Whichever way you prefer, it’s rather satisfying to note that
the result of
the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always
returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated).
The following query forms also produce
the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms):
WHERE Clause
SELECT
p.Name
FROM Production.Product AS p
WHERE
(
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS th
WHERE th.ProductID = p.ProductID
GROUP BY ()
) < 10;
APPLY
SELECT p.Name
FROM Production.Product AS p
CROSS APPLY
(
SELECT NULL
FROM Production.TransactionHistory AS th
WHERE th.ProductID = p.ProductID
GROUP BY ()
HAVING COUNT_BIG(*) < 10
) AS ca (dummy);
FROM Clause
SELECT q1.Name
FROM
(
SELECT
p.Name,
cnt =
(
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS th
WHERE th.ProductID = p.ProductID
GROUP BY ()
)
FROM Production.Product AS p
) AS q1
WHERE
q1.cnt < 10;
This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :)
SELECT q.Name
FROM
(
SELECT
p.Name,
cnt =
(
SELECT SUM(1)
FROM Production.TransactionHistory AS th
WHERE th.ProductID = p.ProductID
)
FROM Production.Product AS p
) AS q
WHERE q.cnt < 10;
The semantics of SQL aggregates are rather odd in places. It definitely pays to get to know
the rules, and to be careful to check whether your queries are using scalar or vector aggregates. As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both.
© 2012 Paul White
Twitter: @SQL_Kiwi
email:
[email protected]