Joining on NULLs
- by Dave Ballantyne
A problem I see on a fairly regular basis is that of dealing with NULL values. Specifically here, where we are joining two tables on two columns, one of which is ‘optional’ ie is nullable. So something like this: i.e. Lookup where all the columns are equal, even when NULL. NULL’s are a tricky thing to initially wrap your mind around. Statements like “NULL is not equal to NULL and neither is it not not equal to NULL, it’s NULL” can cause a serious brain freeze and leave you a gibbering wreck and needing your mummy. Before we plod on, time to setup some data to demo against. Create table #SourceTable
(
Id integer not null,
SubId integer null,
AnotherCol char(255) not null
)
go
create unique clustered index idxSourceTable on #SourceTable(id,subID)
go
with cteNums
as
(
select top(1000) number
from master..spt_values
where type ='P'
)
insert into #SourceTable
select Num1.number,nullif(Num2.number,0),'SomeJunk'
from cteNums num1
cross join cteNums num2
go
Create table #LookupTable
(
Id integer not null,
SubID integer null
)
go
insert into #LookupTable
Select top(100) id,subid
from #SourceTable
where subid is not null
order by newid()
go
insert into #LookupTable
Select top(3) id,subid
from #SourceTable
where subid is null
order by newid()
If that has run correctly, you will have 1 million rows in #SourceTable and 103 rows in #LookupTable. We now want to join one to the other.
First attempt – Lets just join
select *
from #SourceTable join #LookupTable
on #LookupTable.id = #SourceTable.id
and #LookupTable.SubID = #SourceTable.SubID
OK, that’s a fail. We had 100 rows back, we didn’t correctly account for the 3 rows that have null values. Remember NULL <> NULL and the join clause specifies SUBID=SUBID, which for those rows is not true.
Second attempt – Lets deal with those pesky NULLS
select *
from #SourceTable join #LookupTable
on #LookupTable.id = #SourceTable.id
and isnull(#LookupTable.SubID,0) = isnull(#SourceTable.SubID,0)
OK, that’s the right result, well done and 99.9% of the time that is where its left. It is a relatively trivial CPU overhead to wrap ISNULL around both columns and compare that result, so no problems. But, although that’s true, this a relational database we are using here, not a procedural language. SQL is a declarative language, we are making a request to the engine to get the results we want. How we ask for them can make a ton of difference.
Lets look at the plan for our second attempt, specifically the clustered index seek on the #SourceTable
There are 2 predicates. The ‘seek predicate’ and ‘predicate’. The ‘seek predicate’ describes how SQLServer has been able to use an Index. Here, it has been able to navigate the index to resolve where ID=ID. So far so good, but what about the ‘predicate’ (aka residual probe) ? This is a row-by-row operation. For each row found in the index matching the Seek Predicate, the leaf level nodes have been scanned and tested using this logical condition. In this example [Expr1007] is the result of the IsNull operation on #LookupTable and that is tested for equality with the IsNull operation on #SourceTable. This residual probe is quite a high overhead, if we can express our statement slightly differently to take full advantage of the index and make the test part of the ‘Seek Predicate’.
Third attempt – X is null and Y is null
So, lets state the query in a slightly manner:
select *
from #SourceTable join #LookupTable
on #LookupTable.id = #SourceTable.id
and ( #LookupTable.SubID = #SourceTable.SubID or
(#LookupTable.SubID is null and #SourceTable.SubId is null)
)
So its slightly wordier and may not be as clear in its intent to the human reader, that is what comments are for, but the key point is that it is now clearer to the query optimizer what our intention is.
Let look at the plan for that query, again specifically the index seek operation on #SourceTable
No ‘predicate’, just a ‘Seek Predicate’ against the index to resolve both ID and SubID. A subtle difference that can be easily overlooked. But has it made a difference to the performance ? Well, yes , a perhaps surprisingly high one.
Clever query optimizer well done.
If you are using a scalar function on a column, you a pretty much guaranteeing that a residual probe will be used. By re-wording the query you may well be able to avoid this and use the index completely to resolve lookups. In-terms of performance and scalability your system will be in a much better position if you can.