Clustering Strings on the basis of Common Substrings

Posted by pk188 on Programmers See other posts from Programmers or by pk188
Published on 2013-06-26T21:20:35Z Indexed on 2013/06/26 22:29 UTC
Read the original article Hit count: 280

Filed under:

c#

|

sql

|

analysis

|

data-mining

|

cluster

I have around 10000+ strings and have to identify and group all the strings which looks similar(I base the similarity on the number of common words between any two give strings). The more number of common words, more similar the strings would be. For instance:

How to make another layer from an existing layer
Unable to edit data on the network drive
Existing layers in the desktop
Assistance with network drive

In this case, the strings 1 and 3 are similar with common words Existing, Layer and 2 and 4 are similar with common words Network Drive(eliminating stop word)

The steps I'm following are:

Iterate through the data set
Do a row by row comparison
Find the common words between the strings
Form a cluster where number of common words is greater than or equal to 2(eliminating stop words)
If number of common words<2, put the string in a new cluster.
Assign the rows either to the existing clusters or form a new one depending upon the common words
Continue until all the strings are processed

I am implementing the project in C#, and have got till step 3. However, I'm not sure how to proceed with the clustering. I have researched a lot about string clustering but could not find any solution that fits my problem. Your inputs would be highly appreciated.

© Programmers or respective owner

Related posts about c#

.NET WebRequest.PreAuthenticate not quite what it sounds like

as seen on West-Wind - Search for 'West-Wind'
I’ve run into the problem a few times now: How to pre-authenticate .NET WebRequest calls doing an HTTP call to the server – essentially send authentication credentials on the very first request instead of waiting for a server challenge first? At first glance this sound like it should be easy:… >>> More
HttpWebRequest and Ignoring SSL Certificate Errors

as seen on West-Wind - Search for 'West-Wind'
Man I can't believe this. I'm still mucking around with OFX servers and it drives me absolutely crazy how some these servers are just so unbelievably misconfigured. I've recently hit three different 3 major brokerages which fail HTTP validation with bad or corrupt certificates at least according to… >>> More
The dynamic Type in C# Simplifies COM Member Access from Visual FoxPro

as seen on West-Wind - Search for 'West-Wind'
I’ve written quite a bit about Visual FoxPro interoperating with .NET in the past both for ASP.NET interacting with Visual FoxPro COM objects as well as Visual FoxPro calling into .NET code via COM Interop. COM Interop with Visual FoxPro has a number of problems but one of them at least got a lot… >>> More
Dynamic Type to do away with Reflection

as seen on West-Wind - Search for 'West-Wind'
The dynamic type in C# 4.0 is a welcome addition to the language. One thing I’ve been doing a lot with it is to remove explicit Reflection code that’s often necessary when you ‘dynamically’ need to walk and object hierarchy. In the past I’ve had a number of ReflectionUtils that used string based expressions… >>> More
Finding a Relative Path in .NET

as seen on West-Wind - Search for 'West-Wind'
Here’s a nice and simple path utility that I’ve needed in a number of applications: I need to find a relative path based on a base path. So if I’m working in a folder called c:\temp\templates\ and I want to find a relative path for c:\temp\templates\subdir\test.txt I want to receive back subdir\test… >>> More

Related posts about sql

SQL SERVER – Concat Strings in SQL Server using T-SQL – SQL in Sixty Seconds #035 – Video

as seen on SQL Authority - Search for 'SQL Authority'
Concatenating string is one of the most common tasks in SQL Server and every developer has to come across it. We have to concat the string when we have to see the display full name of the person by first name and last name. In this video we will see various methods to concatenate the strings. SQL… >>> More
SQL SERVER – Concat Function in SQL Server – SQL Concatenation

as seen on SQL Authority - Search for 'SQL Authority'
Earlier this week, I was delivering Advanced BI training on the subject of “SQL Server 2008 R2″. I had great time delivering the session. During the session, we talked about SQL Server 2010 Denali. Suddenly one of the attendees suggested his displeasure for the product. He said, even though… >>> More
Error with SQL Server Setup 2012 on Windows 2012

as seen on Server Fault - Search for 'Server Fault'
I am trying to install SQL Server on Windows 2012. I was able to finally get the wizard up and running after making some changes on the server, but now it fails no matter what I do with the following error: TITLE: SQL Server Setup failure. SQL Server Setup has encountered the following error: … >>> More
How can I detect which version of SQL (eg SQL 2008 or SQL Azure)

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to detect which version of SQL I am dealing with to perorm various tasks, I need specifically detect if I am on SQL 2008 or SQL Azure. How can I do this with detection code written in SQL? >>> More
Nested SQL Select statement fails on SQL Server 2000, ok on SQL Server 2005

as seen on Stack Overflow - Search for 'Stack Overflow'
Here is the query: INSERT INTO @TempTable SELECT UserID, Name, Address1 = (SELECT TOP 1 [Address] FROM (SELECT TOP 1 [Address] FROM [UserAddress] ua INNER JOIN UserAddressOrder uo ON ua.UserID = uo.UserID WHERE ua.UserID = u.UserID ORDER BY uo.AddressOrder ASC) q ORDER BY AddressOrder… >>> More