I came across some unexpected behavior while troubleshooting a failing test the other day that took me long enough to figure out that I thought it was worth sharing here. I finally traced the failing test back to a SELECT statement in a stored procedure that was using the IN t-sql operator to exclude a certain set of values. Here’s a very simple example table to illustrate the issue: Customers CustomerId INT, NOT NULL, Primary Key CustomerName nvarchar(100) NOT NULL SalesRegionId INT NULL The ‘SalesRegionId’ column contains a number representing the sales region that the customer belongs to. This column is nullable because new customers get created all the time but assigning them to sales regions is a process that is handled by a regional manager on a periodic basis. For the purposes of this example, the Customers table currently has the following rows: CustomerId CustomerName SalesRegionId 1 Customer A 1 2 Customer B NULL 3 Customer C 4 4 Customer D 2 5 Customer E 3 How could we write a query against this table for all customers that are NOT in sales regions 2 or 4? You might try something like this: 1: SELECT
2: CustomerId,
3: CustomerName,
4: SalesRegionId
5: FROM Customers
6: WHERE SalesRegionId NOT IN (2,4)
Will this work? In short, no; at least not in the way that you might expect. Here’s what this query will return given the example data we’re working with:
CustomerId
CustomerName
SalesRegionId
1
Customer A
1
5
Customer E
5
I was expecting that this query would also return ‘Customer B’, since that customer has a NULL SalesRegionId. In my mind, having a customer with no sales region should be included in a set of customers that are not in sales regions 2 or 4.When I first started troubleshooting my issue I made note of the fact that this query should probably be re-written without the NOT IN clause, but I didn’t suspect that the NOT IN clause was actually the source of the issue. This particular query was only one minor piece in a much larger process that was being exercised via an automated integration test and I simply made a poor assumption that the NOT IN would work the way that I thought it should.
So why doesn’t this work the way that I thought it should? From the MSDN documentation on the t-sql IN operator:
If the value of test_expression is equal to any value returned by subquery or is equal to any expression from the comma-separated list, the result value is TRUE; otherwise, the result value is FALSE.
Using NOT IN negates the subquery value or expression.
The key phrase out of that quote is, “… is equal to any expression from the comma-separated list…”. The NULL SalesRegionId isn’t included in the NOT IN because of how NULL values are handled in equality comparisons. From the MSDN documentation on ANSI_NULLS:
The SQL-92 standard requires that an equals (=) or not equal to (<>) comparison against a null value evaluates to FALSE. When SET ANSI_NULLS is ON, a SELECT statement using WHERE column_name = NULL returns zero rows even if there are null values in column_name. A SELECT statement using WHERE column_name <> NULL returns zero rows even if there are nonnull values in column_name.
In fact, the MSDN documentation on the IN operator includes the following blurb about using NULL values in IN sub-queries or expressions that are used with the IN operator:
Any null values returned by subquery or expression that are compared to test_expression using IN or NOT IN return UNKNOWN. Using null values in together with IN or NOT IN can produce unexpected results.
If I were to include a ‘SET ANSI_NULLS OFF’ command right above my SELECT statement I would get ‘Customer B’ returned in the results, but that’s definitely not the right way to deal with this. We could re-write the query to explicitly include the NULL value in the WHERE clause:
1: SELECT
2: CustomerId,
3: CustomerName,
4: SalesRegionId
5: FROM Customers
6: WHERE (SalesRegionId NOT IN (2,4) OR SalesRegionId IS NULL)
This query works and properly includes ‘Customer B’ in the results, but I ultimately opted to re-write the query using a LEFT OUTER JOIN against a table variable containing all of the values that I wanted to exclude because, in my case, there could potentially be several hundred values to be excluded. If we were to apply the same refactoring to our simple sales region example we’d end up with:
1: DECLARE @regionsToIgnore TABLE (IgnoredRegionId INT)
2: INSERT @regionsToIgnore values (2),(4)
3:
4: SELECT
5: c.CustomerId,
6: c.CustomerName,
7: c.SalesRegionId
8: FROM Customers c
9: LEFT OUTER JOIN @regionsToIgnore r ON r.IgnoredRegionId = c.SalesRegionId
10: WHERE r.IgnoredRegionId IS NULL
By performing a LEFT OUTER JOIN from Customers to the @regionsToIgnore table variable we can simply exclude any rows where the IgnoredRegionId is null, as those represent customers that DO NOT appear in the ignored regions list. This approach will likely perform better if the number of sales regions to ignore gets very large and it also will correctly include any customers that do not yet have a sales region.