Clustering Strings on the basis of Common Substrings
- by pk188
I have around 10000+ strings and have to identify and group all the strings which looks similar(I base the similarity on the number of common words between any two give strings). The more number of common words, more similar the strings would be. For instance:
How to make another layer from an existing layer
Unable to edit data on the network drive
Existing layers in the desktop
Assistance with network drive
In this case, the strings 1 and 3 are similar with common words Existing, Layer and 2 and 4 are similar with common words Network Drive(eliminating stop word)
The steps I'm following are:
Iterate through the data set
Do a row by row comparison
Find the common words between the strings
Form a cluster where number of common words is greater than or equal to 2(eliminating stop words)
If number of common words<2, put the string in a new cluster.
Assign the rows either to the existing clusters or form a new one depending upon the common words
Continue until all the strings are processed
I am implementing the project in C#, and have got till step 3. However, I'm not sure how to proceed with the clustering. I have researched a lot about string clustering but could not find any solution that fits my problem. Your inputs would be highly appreciated.