Complex string matching with fuzzywuzzy
Posted
by
That1Guy
on Programmers
See other posts from Programmers
or by That1Guy
Published on 2012-10-16T19:18:36Z
Indexed on
2012/10/16
23:20 UTC
Read the original article
Hit count: 228
I'm attempting to write a process that matches obscure strings to a single 'master string' for further processing. I have a lot of data that looks something like this:
Basketball
Basket Ball
Football
BasketBallR
BBall
BBall - r
FootB
...and so on. These need to be mapped to a master record like so:
Basketball = Basket Ball, BBall
Basketball - R = BasketBallR, BBall - r
I also have instances of data resembling this format:
Football -r
FootBall - r-g/H,Q,HH
These situations need to be separated into different categories before being mapped. For example FootBall - r-g/H,Q,HH
should be:
Football - r
Football - g
Football - H
Football - Q
Football - HH
At this point, it still needs to be mapped to a master record...
I've tried several different combinations of fuzzywuzzy matching methods, Levenshtein Distance measurements, regex, etc. and can't seem to find a reliable method to logically associate different naming styles of a single item with a master name.
I'm throwing my hands up in desperation. Are there any existing python resources than can help sort out my problem? Are there other options? Can anybody point out an obvious option that I might have overlooked?
Basically, any suggestion, solution, resource or alternative method is greatly appreciated.
© Programmers or respective owner