In SQL Server what is most efficient way to compare records to other records for duplicates with in
Posted
by Glenn
on Stack Overflow
See other posts from Stack Overflow
or by Glenn
Published on 2010-03-24T15:51:24Z
Indexed on
2010/03/24
15:53 UTC
Read the original article
Hit count: 236
We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.
Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.
So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.
For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.
order_id, customer_id product_id, quantity, sale_date, price
We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder. 1. Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review. 2. Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review 3. Records that have the same customer_id, product_id, but different quantities within 20 units, and sales dates within five days of each other should be considered suspect.
Is it possible to satisfy each one of these criteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?
© Stack Overflow or respective owner