How do you efficiently implement a document similarity search system?

Posted by Björn Lindqvist on Stack Overflow See other posts from Stack Overflow or by Björn Lindqvist
Published on 2010-02-03T10:36:32Z Indexed on 2010/03/19 1:51 UTC
Read the original article Hit count: 544

Filed under:

cosine

|

similarity

|

recommendation

|

search

|

engine

How do you implement a "similar items" system for items described by a set of tags?

In my database, I have three tables, Article, ArticleTag and Tag. Each Article is related to a number of Tags via a many-to-many relationship. For each Article i want to find the five most similar articles to implement a "if you like this article you will like these too" system.

I am familiar with Cosine similarity and using that algorithm works very well. But it is way to slow. For each article, I need to iterate over all articles, calculate the cosine similarity for the article pair and then select the five articles with the highest similarity rating.

With 200k articles and 30k tags, it takes me half a minute to calculate the similar articles for a single article. So I need another algorithm that produces roughly as good results as cosine similarity but that can be run in realtime and which does not require me to iterate over the whole document corpus each time.

Maybe someone can suggest an off-the-shelf solution for this? Most of the search engines I looked at does not enable document similarity searching.

© Stack Overflow or respective owner

Related posts about cosine

Find cosine similarity in R

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm wondering if there is a built in function in R that can find the cosine similarity (or cosine distance) between two arrays? Currently, I implemented my own function, but I can't help but think that R should already come with one :) Thanks, Derek >>> More
java cosine similarity problem

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi again :) I developed some java program to calculate cosine similarity on the basis of TF*IDF. It worked very well. But there is one problem.... :( for example: If I have following two matrix and I want to calculate cosine similarity it does not work as rows are not same in length doc 1 1 2 3 4… >>> More
How do you efficiently implement a document similarity search system?

as seen on Stack Overflow - Search for 'Stack Overflow'
How do you implement a "similar items" system for items described by a set of tags? In my database, I have three tables, Article, ArticleTag and Tag. Each Article is related to a number of Tags via a many-to-many relationship. For each Article i want to find the five most similar articles to implement… >>> More
Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to compare documents stored in a DB and come up with a similarity score between 0 and 1. The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity… >>> More
Binary Cosine Cofficient

as seen on Stack Overflow - Search for 'Stack Overflow'
I was given the following forumulae for calculating this sim=|QnD| / v|Q|v|D| I went ahed and implemented a class to compare strings consisting of a series of words #pragma once #include <vector> #include <string> #include <iostream> #include <vector> using namespace… >>> More

Related posts about similarity

Find cosine similarity in R

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm wondering if there is a built in function in R that can find the cosine similarity (or cosine distance) between two arrays? Currently, I implemented my own function, but I can't help but think that R should already come with one :) Thanks, Derek >>> More
fast similarity detection

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a large collection of objects and I need to figure out the similarities between them. To be exact: given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents. The cost of computing this number… >>> More
Lucene numDocs and doqFreq on custom similarity class

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi All, im doing an aplication with Lucene (im a noob with it) and im facing some problems. My aplication uses the Lucene 2.4.0 library with a custom similaraty implementation (the jar is imported) In my app im calculating doqFreq and numDocs manually (im adding the values of all indexes and then… >>> More
Document Similarity: Comparing two documents efficiently

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries This is what I have so far, it works, but is super slow: # Doc A cursor1.execute("SELECT token, tfidf_norm… >>> More
Advice on String Similarity Metrics (Java). Distance, sounds like or combo?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, A part of a process requires to apply String Similarity Algorithms. The results of this process will be stored and produce lets say SS_Dataset. Based on this Dataset, further decisions will have to be made. My questions are: Should i apply one or more string similarity algorithms… >>> More