In SQL Server what is most efficient way to compare records to other records for duplicates with in
- by Glenn
We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.
Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.
So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.
For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.
order_id,
customer_id
product_id,
quantity,
sale_date,
price
We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder.
1. Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review.
2. Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review
3. Records that have the same customer_id, product_id, but different quantities within 20 units, and sales dates within five days of each other should be considered suspect.
Is it possible to satisfy each one of these criteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?