How do you efficiently implement a document similarity search system?
- by Björn Lindqvist
How do you implement a "similar items" system for items described by a
set of tags?
In my database, I have three tables, Article, ArticleTag and Tag. Each
Article is related to a number of Tags via a many-to-many
relationship. For each Article i want to find the five most similar
articles to implement a "if you like this article you will like these
too" system.
I am familiar with Cosine similarity
and using that algorithm works very well. But it is way to slow. For
each article, I need to iterate over all articles, calculate the
cosine similarity for the article pair and then select the five
articles with the highest similarity rating.
With 200k articles and 30k tags, it takes me half a minute to
calculate the similar articles for a single article. So I need
another algorithm that produces roughly as good results as cosine
similarity but that can be run in realtime and which does not require
me to iterate over the whole document corpus each time.
Maybe someone can suggest an off-the-shelf solution for this? Most of
the search engines I looked at does not enable document similarity
searching.