Please help fix and optimize this query

Posted by user607217 on Stack Overflow See other posts from Stack Overflow or by user607217
Published on 2012-10-25T16:56:52Z Indexed on 2012/10/25 17:00 UTC
Read the original article Hit count: 339

Filed under:

I am working on a system to find potential duplicates in our customers table (SQL 2005). I am using the built-in SOUNDEX value that our software computes when customers are added/updated, but I also implemented the double metaphone algorithm for better matching. This is the most-nested query I have created, and I can't help but think there is a better way to do it and I'd like to learn.

In the inner-most query I am joining the customer table to the metaphone table I created, then finding customers that have identical pKey (primary phonetic key). I take that, union that with customers that have matching soundex values, and then proceed to score those matches with various text similarity functions.

This is currently working, but I would also like to add a union of customers whose aKey (alternate phonetic key) match. This would be identical to "QUERY A" except to substitute on (c1Akey = c2Akey) for the join. However, when I attempt to include that, I get errors when I try to execute my query.

Here is the code:

    --Create aggregate ranking
select c1Name, c2Name, nDiff, c1Addr, c2Addr, aDiff, c1SSN, c2SSN, sDiff, c1DOB, c2DOB, dDiff, nDiff+aDiff+dDiff+sDiff as Score
        ,(sDiff+dDiff)*1.5 + (nDiff+dDiff)*1.5 + (nDiff+sDiff)*1.5 + aDiff *.5 + nDiff *.5 as [Rank]
    FROM
    (
    --Create match scores for different fields
    SELECT     c1Name, c2Name, c1Addr, c2Addr, c1SSN, c2SSN, c1LTD, 
                      c2LTD, c1DOB, c2DOB, dbo.Jaro(c1name, c2name) AS nDiff, dbo.JaroWinkler(c1addr, c2addr) AS aDiff, 
                      CASE WHEN c1dob = '1901-01-01' OR
                      c2dob = '1901-01-01' OR
                      c1dob = '1800-01-01' OR
                      c2dob = '1800-01-01' THEN .5 ELSE dbo.SmithWaterman(c1dob, c2dob) END AS dDiff, CASE WHEN c1ssn = '000-00-0000' OR
                      c2ssn = '000-00-0000' THEN .5 ELSE dbo.Jaro(c1ssn, c2ssn) END AS sDiff
FROM    
-- Generate list of possible matches based on multiple phonetic matching fields
( select * from 
    -- List of similar names from pKey field of ##Metaphone table
    --QUERY A BEGIN
    (select customers.custno as c1Custno, name as c1Name, haddr as c1Addr, ssn as c1SSN, lasttripdate as c1LTD, dob as c1DOB, soundex as c1Soundex, pkey as c1Pkey, akey as c1Akey from Customers WITH (nolock) join ##Metaphone on customers.custno = ##Metaphone.custno) as c1
        JOIN
    (select customers.custno as c2Custno, name as c2Name, haddr as c2Addr, ssn as c2SSN, lasttripdate as c2LTD, dob as c2DOB, soundex as c2Soundex, pkey as c2Pkey, akey as c2Akey from Customers with (nolock) join ##Metaphone on customers.custno = ##Metaphone.custno) as c2
        on (c1Pkey = c2Pkey) and (c1Custno < c2Custno)
    WHERE (c1Name <> 'PARENT, GUARDIAN') and c1soundex != c2soundex
    --QUERY A END

    union
        --List of similar names from pregenerated SOUNDEX field
        (select t1.custno, t1.name, t1.haddr, t1.ssn, t1.lasttripdate, t1.dob, t1.[soundex], 0, 0, t2.custno, t2.name, t2.haddr, t2.ssn, t2.lasttripdate, t2.dob, t2.[soundex], 0, 0 from Customers t1 WITH (nolock) 
            join customers t2 with (nolock) on t1.[soundex] = t2.[soundex]  and t1.custno < t2.custno
                where (t1.name <> 'PARENT, GUARDIAN'))

) as a
) as b
where (sDiff+dDiff)*1.5 + (nDiff+dDiff)*1.5 + (nDiff+sDiff)*1.5 + aDiff *.5 + nDiff *.5   >= 7.5
order by [rank] desc, score desc

Previously, I was using joins such as on c1.pkey = c2.pkey or c1.akey = c2.akey or c1.soundex = c2.soundex but the performance was horrendous, and using unions seems to be working a lot better. Out of 103K customers, tt is currently generating a list of 8.5M potential matches (based on the phonetic codes) in 2.25 minutes, and then taking another 2 to score, rank and filter those down to about 3000. So I am happy with the performance, I just can't help but think there is a better way to structure this, and I need help adding the extra union condition.

Thanks!

Related posts about duplicates

Excel "Remove Duplicates" feature does not remove all duplicates

as seen on Super User - Search for 'Super User'
When the Data ToolsRemove Duplicates command is used on the following list, some duplicates are not removed: Key 46509 46509 51442 51442 73367 73367 80048 80048 88396 88396 102576 102576 108095 108095 118613 118613 127224 127224 137549 137549 140164 140164 151679 151679 152975 152975 154986 154986 167715 167715 169166 169166 170427 170427 176303 176303 187038 187038 199856 199856 206589 206589 225371 225371 229767 229767 230680 230680 245333 245333 253697 253697 256002 256002 262094 262094 273060 273060 275584 275584 285569 285569 301930 301930 304993 304993 305965 305965 307299 307299 311438 311438 314461 314461 340545 340545 344517 344517 355035 355035 358387 358387 359982 359982 363904 363904 368642 368642 378666 378666 378744 378744 382396 382396 393542 393542 394320 394320 395258 395258 402047 402047 421796 421796 427972 427972 442651 442651 442654 442654 445786 445786 447124 447124 456682 456682 467303 467303 470437 470437 476340 476340 480976 480976 481450 481450 498967 498967 516068 516068 517592 517592 522621 522621 524678 524678 529405 529405 532909 532909 535096 535096 540835 540835 553197 553197 566893 566893 567730 567730 570751 570751 572259 572259 581701 581701 589753 589753 597923 597923 602282 602282 612011 612011 643774 643774 649471 649471 654159 654159 661681 661681 680578 680578 709714 709714 728036 728036 728680 728680 740252 740252 750843 750843 763430 763430 777724 777724 777870 777870 779703 779703 787034 787034 789776 789776 793859 793859 798479 798479 812410 812410 816599 816599 820169 820169 821236 821236 833511 833511 833701 833701 845925 845925 854408 854408 882853 882853 891710 891710 906251 906251 929802 929802 982039 982039 984503 984503 984539 984539 987889 987889 997188 997188 1002869 1002869 1018493 1018493 1020735 1020735 1028780 1028780 1032460 1032460 1047306 1047306 1052136 1052136 1054482 1054482 1062633 1062633 1066509 1066509 1069500 1069500 1075339 1075339 1082282 1082282 1100206 1100206 1105407 1105407 1111409 1111409 1112175 1112175 1112407 1112407 1113400 1113400 1117243 1117243 1121732 1121732 1152461 1152461 1153727 1153727 1157252 1157252 1163384 1163384 1195324 1195324 1197373 1197373 1201828 1201828 1202786 1202786 1216328 1216328 1223541 1223541 1225848 1225848 1228595 1228595 1239281 1239281 1264378 1264378 1267348 1267348 1291510 1291510 1309891 1309891 1321106 1321106 1322886 1322886 1327690 1327690 1344145 1344145 1346250 1346250 1352350 1352350 1357920 1357920 1363386 1363386 1365046 1365046 1370383 1370383 1376666 1376666 1383834 1383834 1414472 1414472 1418967 1418967 1419811 1419811 1421587 1421587 1432639 1432639 1442854 1442854 1455236 1455236 1461896 1461896 1466376 1466376 1477670 1477670 1480154 1480154 1481801 1481801 1498209 1498209 1504776 1504776 1511904 1511904 1017854_447 1017854_447 1085496_19 1085496_19 1280601_77 1280601_77 1350662_135 1350662_135 141944_382 141944_382 1525893_483 1525893_483 455968_177 455968_177 469495_157 469495_157 641583_74 641583_74 704267_63 704267_63 709992_199 709992_199 819399_224 819399_224 832188_358 832188_358 857189_467 857189_467 898328_334 898328_334 91033_245 91033_245 960992_341 960992_341 999706_361 999706_361 BLA1048669 BLA1048669 BLA1052457 BLA1052457 BLA1055176 BLA1055176 BLA1060585 BLA1060585 BLA1067125 BLA1067125 BLA107601 BLA107601 BLA1087509 BLA1087509 BLA1097390 BLA1097390 BLA1122471 BLA1122471 BLA1126436 BLA1126436 BLA112694 BLA112694 BLA1148652 BLA1148652 BLA1149504 BLA1149504 BLA1155256 BLA1155256 BLA1222348 BLA1222348 BLA1225575 BLA1225575 BLA1227779 BLA1227779 BLA124071 BLA124071 BLA1247411 BLA1247411 BLA1258830 BLA1258830 BLA1281625 BLA1281625 BLA128165 BLA128165 BLA1305848 BLA1305848 BLA1307751 BLA1307751 BLA130926 BLA130926 BLA1334270 BLA1334270 BLA1335674 BLA1335674 BLA1340102 BLA1340102 BLA1345080 BLA1345080 BLA1353398 BLA1353398 BLA1369834 BLA1369834 BLA1424977 BLA1424977 BLA142669 BLA142669 BLA1429760 BLA1429760 BLA1443592 BLA1443592 BLA1447085 BLA1447085 BLA1453639 BLA1453639 BLA1456355 BLA1456355 BLA1458745 BLA1458745 BLA148663 BLA148663 BLA1494220 BLA1494220 BLA1522880 BLA1522880 BLA152562 BLA152562 BLA1548628 BLA1548628 BLA182644 BLA182644 BLA186707 BLA186707 BLA188073 BLA188073 BLA199127 BLA199127 BLA207148 BLA207148 BLA209939 BLA209939 BLA232119 BLA232119 BLA240725 BLA240725 BLA267275 BLA267275 BLA303829 BLA303829 BLA346044 BLA346044 BLA348268 BLA348268 BLA367863 BLA367863 BLA386897 BLA386897 BLA390799 BLA390799 BLA415724 BLA415724 BLA425291 BLA425291 BLA427093 BLA427093 BLA462355 BLA462355 BLA46821 BLA46821 BLA477637 BLA477637 BLA501111 BLA501111 BLA522126 BLA522126 BLA530906 BLA530906 BLA576525 BLA576525 BLA639361 BLA639361 BLA644530 BLA644530 BLA65754 BLA65754 BLA670003 BLA670003 BLA687525 BLA687525 BLA709367 BLA709367 BLA714281 BLA714281 BLA719328 BLA719328 BLA736457 BLA736457 BLA759669 BLA759669 BLA76056 BLA76056 BLA767666 BLA767666 BLA780853 BLA780853 BLA793029 BLA793029 BLA796639 BLA796639 BLA803743 BLA803743 BLA815298 BLA815298 BLA828243 BLA828243 BLA864030 BLA864030 BLA864110 BLA864110 BLA882892 BLA882892 BLA885665 BLA885665 BLA897478 BLA897478 BLA903861 BLA903861 BLA91627 BLA91627 BLA925523 BLA925523 BLA928243 BLA928243 BLA932395 BLA932395 BLA935341 BLA935341 BLA940705 BLA940705 BLA966386 BLA966386 A… >>> More
Remove duplicates from a sorted ArrayList while keeping some elements from the duplicates

as seen on Stack Overflow - Search for 'Stack Overflow'
Okay at first I thought this would be pretty straightforward. But I can't think of an efficient way to solve this. I figured a brute force way to solve this but that's not very elegant. I have an ArrayList. Contacts is a VO class that has multiple members - name, regions, id. There are duplicates… >>> More
how to remove duplicates but keep the same order?

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a cell array in Matlab y = { 'd' 'f' 'a' 'g' 'g' 'a' 'w' 'h'} I use unique(y) to get rid of the duplicates but it rearranges the strings in alphabetica order >> unique(y) ans = 'a' 'd' 'f' 'g' 'h' 'w' Like this I want to remove the duplicates but keep the same… >>> More
Remove duplicates and update IDs linked to nonduplicates

as seen on Stack Overflow - Search for 'Stack Overflow'
I have two tables, tableA and tableB, linked through a locationID. TableA has descriptive survey info with each record from a different time (ie unique), while TableB has purely locational information. However, there are lots of duplicates in TableB, yet each has a unique locationID, which has an… >>> More
Removing Duplicate entries in grub2 Ubuntu 9.10

as seen on Super User - Search for 'Super User'
I have made a custom grub2 menu however, both the default and the custom show together. So my grub looks like the list below, the bolded entries are my custom ones. How do I get rid of the duplicates? I have tried apt-get remove and deleting old kernels. I am a bit lost. Thanks! in Advance. ubuntu… >>> More

Developer IT

Please help fix and optimize this query - Developer IT

Please help fix and optimize this query

sql

duplicates

phonetics

server

Related posts about sql

SQL SERVER – Concat Strings in SQL Server using T-SQL – SQL in Sixty Seconds #035 – Video

SQL SERVER – Concat Function in SQL Server – SQL Concatenation

Error with SQL Server Setup 2012 on Windows 2012

How can I detect which version of SQL (eg SQL 2008 or SQL Azure)

Nested SQL Select statement fails on SQL Server 2000, ok on SQL Server 2005

Related posts about duplicates

Excel "Remove Duplicates" feature does not remove all duplicates

Remove duplicates from a sorted ArrayList while keeping some elements from the duplicates

how to remove duplicates but keep the same order?

Remove duplicates and update IDs linked to nonduplicates

Removing Duplicate entries in grub2 Ubuntu 9.10

Categories cloud