Create a unique ID by fuzzy matching of names (via agrep using R)

Posted by tbrambor on Stack Overflow See other posts from Stack Overflow or by tbrambor
Published on 2012-10-21T16:31:00Z Indexed on 2012/10/21 17:00 UTC
Read the original article Hit count: 402

Filed under:
|
|
|

Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names.

A sample chunk of the dataset is structured as follows:

df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013",                             "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO DE ALMEIDA","PAULO CESAR FERREIRA DE ARAUJO"), ncol=3,dimnames=list(seq(1:8),c("citycode","year","candidate")) ))

The neat version:

  citycode year                      candidate
1  1200013 1996      AGUSTINHO FORTUNATO FILHO
2  1200013 1996           ANTONIO PEREIRA NETO
3  1200013 1996         FERNANDO JOSE DA COSTA
4  1200013 1996 PAULO CEZAR FERREIRA DE ARAUJO
5  1200013 2000 PAULO CESAR FERREIRA DE ARAUJO
6  1200013 2000    SEBASTIAO BOCALOM RODRIGUES
7  1200013 2004                JOAO DE ALMEIDA
8  1200013 2004 PAULO CESAR FERREIRA DE ARAUJO

I'd like to check in each city separately, whether there are candidates appearing in several years. E.g. in the example,

PAULO CEZAR FERREIRA DE ARAUJO

PAULO CESAR FERREIRA DE ARAUJO

appears twice (with a spelling mistake). Each candidate across the entire data set should be assigned a unique numeric candidate ID. The dataset is fairly large (5500 cities, approx. 100K entries) so a somewhat efficient coding would be helpful. Any suggestions as to how to implement this?

© Stack Overflow or respective owner

Related posts about r

    Related posts about string-matching