Create a unique ID by fuzzy matching of names (via agrep using R)

Posted by tbrambor on Stack Overflow See other posts from Stack Overflow or by tbrambor
Published on 2012-10-21T16:31:00Z Indexed on 2012/10/21 17:00 UTC
Read the original article Hit count: 548

Filed under:

Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names.

A sample chunk of the dataset is structured as follows:

df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013",                             "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO DE ALMEIDA","PAULO CESAR FERREIRA DE ARAUJO"), ncol=3,dimnames=list(seq(1:8),c("citycode","year","candidate")) ))

The neat version:

  citycode year                      candidate
1  1200013 1996      AGUSTINHO FORTUNATO FILHO
2  1200013 1996           ANTONIO PEREIRA NETO
3  1200013 1996         FERNANDO JOSE DA COSTA
4  1200013 1996 PAULO CEZAR FERREIRA DE ARAUJO
5  1200013 2000 PAULO CESAR FERREIRA DE ARAUJO
6  1200013 2000    SEBASTIAO BOCALOM RODRIGUES
7  1200013 2004                JOAO DE ALMEIDA
8  1200013 2004 PAULO CESAR FERREIRA DE ARAUJO

I'd like to check in each city separately, whether there are candidates appearing in several years. E.g. in the example,

PAULO CEZAR FERREIRA DE ARAUJO

PAULO CESAR FERREIRA DE ARAUJO

appears twice (with a spelling mistake). Each candidate across the entire data set should be assigned a unique numeric candidate ID. The dataset is fairly large (5500 cities, approx. 100K entries) so a somewhat efficient coding would be helpful. Any suggestions as to how to implement this?

Related posts about string-matching

Approximate string matching with a letter confusion matrix?

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to model a phonetic recognizer that has to isolate instances of words (strings of phones) out of a long stream of phones that doesn't have gaps between each word. The stream of phones may have been poorly recognized, with letter substitutions/insertions/deletions, so I will have to do… >>> More
sample java code for approximate string matching or boyer-moore extended for approximate string matc

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi I need to find 1.mismatch(incorrectly played notes), 2.insertion(additional played), & 3.deletion (missed notes), in a music piece (e.g. note pitches [string values] stored in a table) against a reference music piece. This is either possible through exact string matching algorithms or dynamic… >>> More
Ranking based string matching algorithm..for Midi Music

as seen on Stack Overflow - Search for 'Stack Overflow'
i am working on midi music project. What i am trying to do is:- matching the Instrument midi track with the similar instrument midi track... for example Flute track in a some midi music is matched against the Flute track in some other music midi file... After matching ,the results should come ranking… >>> More
String matching.

as seen on Stack Overflow - Search for 'Stack Overflow'
How to match the string Net-----Amount (or here between Net and Amount there can be any number of space) with net amount ? Consider ----- as space because I could not keep the space between these two words in the editor. >>> More
String Matching.

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a string String mainString="///BUY/SELL///ORDERTIME///RT///QTY///BROKERAGE///NETRATE///AMOUNTRS///RATE///SCNM///"; Now I have another strings String str1= "RT"; which should be matched only with RT which is substring of string mainString but not with ORDERTIME which is also substring… >>> More

Developer IT

Create a unique ID by fuzzy matching of names (via agrep using R) - Developer IT

Create a unique ID by fuzzy matching of names (via agrep using R)

r

string-matching

fuzzy

agrep

Related posts about r

Related posts about string-matching

Approximate string matching with a letter confusion matrix?

sample java code for approximate string matching or boyer-moore extended for approximate string matc

Ranking based string matching algorithm..for Midi Music

String matching.

String Matching.

Categories cloud