Clustering Strings on the basis of Common Substrings
Posted
by
pk188
on Programmers
See other posts from Programmers
or by pk188
Published on 2013-06-26T21:20:35Z
Indexed on
2013/06/26
22:29 UTC
Read the original article
Hit count: 220
I have around 10000+ strings and have to identify and group all the strings which looks similar(I base the similarity on the number of common words between any two give strings). The more number of common words, more similar the strings would be. For instance:
- How to make another layer from an existing layer
- Unable to edit data on the network drive
- Existing layers in the desktop
- Assistance with network drive
In this case, the strings 1 and 3 are similar with common words Existing, Layer and 2 and 4 are similar with common words Network Drive(eliminating stop word)
The steps I'm following are:
- Iterate through the data set
- Do a row by row comparison
- Find the common words between the strings
- Form a cluster where number of common words is greater than or equal to 2(eliminating stop words)
- If number of common words<2, put the string in a new cluster.
- Assign the rows either to the existing clusters or form a new one depending upon the common words
- Continue until all the strings are processed
I am implementing the project in C#, and have got till step 3. However, I'm not sure how to proceed with the clustering. I have researched a lot about string clustering but could not find any solution that fits my problem. Your inputs would be highly appreciated.
© Programmers or respective owner