SQL Spatial: Getting “nearest” calculations working properly

Posted by Rob Farley on SQL Blog See other posts from SQL Blog or by Rob Farley
Published on Thu, 14 Aug 2014 10:16:44 GMT Indexed on 2014/08/18 16:38 UTC
Read the original article Hit count: 634

Filed under:
|
|

If you’ve ever done spatial work with SQL Server, I hope you’ve come across the ‘nearest’ problem.

You have five thousand stores around the world, and you want to identify the one that’s closest to a particular place. Maybe you want the store closest to the LobsterPot office in Adelaide, at -34.925806, 138.605073. Or our new US office, at 42.524929, -87.858244. Or maybe both!

You know how to do this. You don’t want to use an aggregate MIN or MAX, because you want the whole row, telling you which store it is. You want to use TOP, and if you want to find the closest store for multiple locations, you use APPLY. Let’s do this (but I’m going to use addresses in AdventureWorks2012, as I don’t have a list of stores). Oh, and before I do, let’s make sure we have a spatial index in place. I’m going to use the default options.

CREATE SPATIAL INDEX spin_Address ON Person.Address(SpatialLocation);

And my actual query:

WITH MyLocations AS
(SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),
                       ('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))
               ) t (Name, Geo))
SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country
FROM MyLocations AS l
CROSS APPLY (
    SELECT TOP (1) *
    FROM Person.Address AS ad
    ORDER BY l.Geo.STDistance(ad.SpatialLocation)
    ) AS a
JOIN Person.StateProvince AS s
    ON s.StateProvinceID = a.StateProvinceID
JOIN Person.CountryRegion AS c
    ON c.CountryRegionCode = s.CountryRegionCode
;

image

Great! This is definitely working. I know both those City locations, even if the AddressLine1s don’t quite ring a bell. I’m sure I’ll be able to find them next time I’m in the area.

But of course what I’m concerned about from a querying perspective is what’s happened behind the scenes – the execution plan.

image

This isn’t pretty. It’s not using my index. It’s sucking every row out of the Address table TWICE (which sucks), and then it’s sorting them by the distance to find the smallest one. It’s not pretty, and it takes a while. Mind you, I do like the fact that it saw an indexed view it could use for the State and Country details – that’s pretty neat. But yeah – users of my nifty website aren’t going to like how long that query takes.

The frustrating thing is that I know that I can use the index to find locations that are within a particular distance of my locations quite easily, and Microsoft recommends this for solving the ‘nearest’ problem, as described at http://msdn.microsoft.com/en-au/library/ff929109.aspx.

Now, in the first example on this page, it says that the query there will use the spatial index. But when I run it on my machine, it does nothing of the sort.

image

I’m not particularly impressed. But what we see here is that parallelism has kicked in. In my scenario, it’s split the data up into 4 threads, but it’s still slow, and not using my index. It’s disappointing.

But I can persuade it with hints!

If I tell it to FORCESEEK, or use my index, or even turn off the parallelism with MAXDOP 1, then I get the index being used, and it’s a thing of beauty! Part of the plan is here:

image

It’s massive, and it’s ugly, and it uses a TVF… but it’s quick.

The way it works is to hook into the GeodeticTessellation function, which is essentially finds where the point is, and works out through the spatial index cells that surround it. This then provides a framework to be able to see into the spatial index for the items we want. You can read more about it at http://msdn.microsoft.com/en-us/library/bb895265.aspx#tessellation – including a bunch of pretty diagrams. One of those times when we have a much more complex-looking plan, but just because of the good that’s going on.

This tessellation stuff was introduced in SQL Server 2012. But my query isn’t using it.

When I try to use the FORCESEEK hint on the Person.Address table, I get the friendly error:

Msg 8622, Level 16, State 1, Line 1
Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

And I’m almost tempted to just give up and move back to the old method of checking increasingly large circles around my location. After all, I can even leverage multiple OUTER APPLY clauses just like I did in my recent Lookup post.

WITH MyLocations AS
(SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),
                       ('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))
               ) t (Name, Geo))
SELECT
    l.Name,
    COALESCE(a1.AddressLine1,a2.AddressLine1,a3.AddressLine1),
    COALESCE(a1.City,a2.City,a3.City),
    s.Name AS [State],
    c.Name AS Country
FROM MyLocations AS l
OUTER APPLY (
    SELECT TOP (1) *
    FROM Person.Address AS ad
    WHERE l.Geo.STDistance(ad.SpatialLocation) < 1000
    ORDER BY l.Geo.STDistance(ad.SpatialLocation)
    ) AS a1
OUTER APPLY (
    SELECT TOP (1) *
    FROM Person.Address AS ad
    WHERE l.Geo.STDistance(ad.SpatialLocation) < 5000
    AND a1.AddressID IS NULL
    ORDER BY l.Geo.STDistance(ad.SpatialLocation)
    ) AS a2
OUTER APPLY (
    SELECT TOP (1) *
    FROM Person.Address AS ad
    WHERE l.Geo.STDistance(ad.SpatialLocation) < 20000
    AND a2.AddressID IS NULL
    ORDER BY l.Geo.STDistance(ad.SpatialLocation)
    ) AS a3
JOIN Person.StateProvince AS s
    ON s.StateProvinceID = COALESCE(a1.StateProvinceID,a2.StateProvinceID,a3.StateProvinceID)
JOIN Person.CountryRegion AS c
    ON c.CountryRegionCode = s.CountryRegionCode
;

But this isn’t friendly-looking at all, and I’d use the method recommended by Isaac Kunen, who uses a table of numbers for the expanding circles.

It feels old-school though, when I’m dealing with SQL 2012 (and later) versions. So why isn’t my query doing what it’s supposed to? Remember the query...

WITH MyLocations AS
(SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),
                       ('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))
               ) t (Name, Geo))
SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country
FROM MyLocations AS l
CROSS APPLY (
    SELECT TOP (1) *
    FROM Person.Address AS ad
    ORDER BY l.Geo.STDistance(ad.SpatialLocation)
    ) AS a
JOIN Person.StateProvince AS s
    ON s.StateProvinceID = a.StateProvinceID
JOIN Person.CountryRegion AS c
    ON c.CountryRegionCode = s.CountryRegionCode
;

Well, I just wasn’t reading http://msdn.microsoft.com/en-us/library/ff929109.aspx properly.

The following requirements must be met for a Nearest Neighbor query to use a spatial index:

  1. A spatial index must be present on one of the spatial columns and the STDistance() method must use that column in the WHERE and ORDER BY clauses.

  2. The TOP clause cannot contain a PERCENT statement.

  3. The WHERE clause must contain a STDistance() method.

  4. If there are multiple predicates in the WHERE clause then the predicate containing STDistance() method must be connected by an AND conjunction to the other predicates. The STDistance() method cannot be in an optional part of the WHERE clause.

  5. The first expression in the ORDER BY clause must use the STDistance() method.

  6. Sort order for the first STDistance() expression in the ORDER BY clause must be ASC.

  7. All the rows for which STDistance returns NULL must be filtered out.

Let’s start from the top.

1. Needs a spatial index on one of the columns that’s in the STDistance call. Yup, got the index.

2. No ‘PERCENT’. Yeah, I don’t have that.

3. The WHERE clause needs to use STDistance(). Ok, but I’m not filtering, so that should be fine.

4. Yeah, I don’t have multiple predicates.

5. The first expression in the ORDER BY is my distance, that’s fine.

6. Sort order is ASC, because otherwise we’d be starting with the ones that are furthest away, and that’s tricky.

7. All the rows for which STDistance returns NULL must be filtered out. But I don’t have any NULL values, so that shouldn’t affect me either.

...but something’s wrong. I do actually need to satisfy #3. And I do need to make sure #7 is being handled properly, because there are some situations (eg, differing SRIDs) where STDistance can return NULL. It says so at http://msdn.microsoft.com/en-us/library/bb933808.aspx – “STDistance() always returns null if the spatial reference IDs (SRIDs) of the geography instances do not match.” So if I simply make sure that I’m filtering out the rows that return NULL…

…then it’s blindingly fast, I get the right results, and I’ve got the complex-but-brilliant plan that I wanted.

image

It just wasn’t overly intuitive, despite being documented.

@rob_farley

© SQL Blog or respective owner

Related posts about Performance

Related posts about spatial