How do I cluster strings based on a relation between two strings?
- by Tom Wijsman
If you don't know WEKA, you can try a theoretical answer. I don't need literal code/examples...
I have a huge data set of strings in which I want to cluster the strings to find the most related ones, these could as well be seen as duplicate. I already have a set of couples of string for which I know that they are duplicate to each other, so, now I want to do some data mining on those two sets.
The result I'm looking for is a system that would return me the possible most relevant couples of strings for which we don't know yet that they are duplicates, I believe that I need clustering for this, which type?
Note that I want to base myself on word occurrence comparison, not on interpretation or meaning.
Here is an example of two string of which we know they are duplicate (in our vision on them):
The weather is really cold and it is raining.
It is raining and the weather is really cold.
Now, the following strings also exist (most to least relevant, ignoring stop words):
Is the weather really that cold today?
Rainy days are awful.
I see the sunshine outside.
The software would return the following two strings as most relevant, which aren't known to be duplicate:
The weather is really cold and it is raining.
Is the weather really that cold today?
Then, I would mark that as duplicate or not duplicate and it would present me with another couple.
How do I go to implement this in the most efficient way that I can apply to a large data set?