De-dupe a list of hundreds of thousands of first name/last name/address/date of birth
Posted
by
Darren
on Stack Overflow
See other posts from Stack Overflow
or by Darren
Published on 2011-01-13T01:50:27Z
Indexed on
2011/01/13
1:53 UTC
Read the original article
Hit count: 620
mysql
I have a large data set which I know contains many dupicate records. Basically I have data on first name, last name, different address components and date of birth.
I think the best way to do this is to use the name and date of birth as chances are if these things match, it's the same person. There are probably lots of instances where there are slight differences in spelling (like typos missing a single letter) or use of name (ie: some might have a middle initial in first name column) which would be good to account for, but I'm not sure how to approach this.
Are there any tools or articles on going about this process? The data is all in a MySQL database and I have a basic proficiency in SQL.
© Stack Overflow or respective owner