How do I cluster strings based on a relation between two strings?

Posted by Tom Wijsman on Programmers See other posts from Programmers or by Tom Wijsman
Published on 2011-08-15T15:07:06Z Indexed on 2013/06/26 22:29 UTC
Read the original article Hit count: 222

Filed under:

data-mining

If you don't know WEKA, you can try a theoretical answer. I don't need literal code/examples...

I have a huge data set of strings in which I want to cluster the strings to find the most related ones, these could as well be seen as duplicate. I already have a set of couples of string for which I know that they are duplicate to each other, so, now I want to do some data mining on those two sets.

The result I'm looking for is a system that would return me the possible most relevant couples of strings for which we don't know yet that they are duplicates, I believe that I need clustering for this, which type?

Note that I want to base myself on word occurrence comparison, not on interpretation or meaning.

Here is an example of two string of which we know they are duplicate (in our vision on them):

The weather is really cold and it is raining.
It is raining and the weather is really cold.

Now, the following strings also exist (most to least relevant, ignoring stop words):

Is the weather really that cold today?
Rainy days are awful.
I see the sunshine outside.

The software would return the following two strings as most relevant, which aren't known to be duplicate:

The weather is really cold and it is raining.
Is the weather really that cold today?

Then, I would mark that as duplicate or not duplicate and it would present me with another couple.

How do I go to implement this in the most efficient way that I can apply to a large data set?

Related posts about data-mining

SQLAuthority News – Links to Book On Line – Data Mining Algorithms (Analysis Services – Data Mining)

as seen on SQL Authority - Search for 'SQL Authority'
I have quite often received request for the Data Mining Algorithms details. Book Online has wonderful resources for the same. I suggest to read them here. Data Mining Algorithms (Analysis Services – Data Mining) The data mining algorithm is the mechanism that creates a data mining model. To… >>> More
Data Mining Resources

as seen on SQL Blog - Search for 'SQL Blog'
There are many different types of analyses, each one with its own pros and cons. Relational reports have a predefined structure, and end users cannot change it. They are simple to use for end users. Reports can use real-time data and snapshots of data to show the state of a report at specific points… >>> More
Integrating Data Mining into your BI Solution (Presentation)

as seen on SQLIS - Search for 'SQLIS'
I recently gave a live meeting presentation to the UK User Group on Integrating Data Mining into your BI Solution. In it I talk about and demo ways of using your data mining models inside Integration Services, Analysis Services and Reporting Services. This is the first in a series of presentations… >>> More
What data mining tools do you use?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello everyone, Besides the two well-known Open Source tools RapidMiner and Weka, are there any other good tools (either Open Source or Commercial), which you can recommend for data mining? Thanks in advance! >>> More
NEW 2-Day Instructor Led Course on Oracle Data Mining Now Available!

as seen on Oracle Blogs - Search for 'Oracle Blogs'
A NEW 2-Day Instructor Led Course on Oracle Data Mining has been developed for customers and anyone wanting to learn more about data mining, predictive analytics and knowledge discovery inside the Oracle Database. Course Objectives: Explain basic data mining concepts and… >>> More

Developer IT