How can I optimize retrieving lowest edit distance from a large table in SQL?
Posted
by Matt
on Stack Overflow
See other posts from Stack Overflow
or by Matt
Published on 2010-05-21T21:36:04Z
Indexed on
2010/05/21
21:40 UTC
Read the original article
Hit count: 286
Hey,
I'm having troubles optimizing this Levenshtein Distance calculation I'm doing. I need to do the following:
- Get the record with the minimum distance for the source string as well as a trimmed version of the source string
- Pick the record with the minimum distance
- If the min distances are equal (original vs trimmed), choose the trimmed one with the lowest distance
- If there are still multiple records that fall under the above two categories, pick the one with the highest frequency
Here's my working version:
DECLARE @Results TABLE
(
ID int,
[Name] nvarchar(200),
Distance int,
Frequency int,
Trimmed bit
)
INSERT INTO @Results
SELECT ID,
[Name],
(dbo.Levenshtein(@Source, [Name])) As Distance,
Frequency,
'False' As Trimmed
FROM
MyTable
INSERT INTO @Results
SELECT ID,
[Name],
(dbo.Levenshtein(@SourceTrimmed, [Name])) As Distance,
Frequency,
'True' As Trimmed
FROM
MyTable
SET @ResultID = (SELECT TOP 1 ID FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @Result = (SELECT TOP 1 [Name] FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @ResultDist = (SELECT TOP 1 Distance FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @ResultTrimmed = (SELECT TOP 1 Trimmed FROM @Results ORDER BY Distance, Trimmed, Frequency)
I believe what I need to do here is to..
- Not dumb the results to a temporary table
- Do only 1 select from `MyTable`
- Setting the results right in the select from the initial select statement. (Since select will set variables and you can set multiple variables in one select statement)
I know there has to be a good implementation to this but I can't figure it out... this is as far as I got:
SELECT top 1 @ResultID = ID,
@Result = [Name],
(dbo.Levenshtein(@Source, [Name])) As distOrig,
(dbo.Levenshtein(@SourceTrimmed, [Name])) As distTrimmed,
Frequency
FROM
MyTable
WHERE /* ... yeah I'm lost */
ORDER BY distOrig, distTrimmed, Frequency
Any ideas?
© Stack Overflow or respective owner