Algorithm for finding similar users through a join table

Posted by Gdeglin on Stack Overflow See other posts from Stack Overflow or by Gdeglin
Published on 2010-05-20T22:50:36Z Indexed on 2010/05/20 23:20 UTC
Read the original article Hit count: 285

Filed under:
|

I have an application where users can select a variety of interests from around 300 possible interests. Each selected interest is stored in a join table containing the columns user_id and interest_id.

Typical users select around 50 interests out of the 300.

I would like to build a system where users can find the top 20 users that have the most interests in common with them.

Right now I am able to accomplish this using the following query:

SELECT i2.user_id, count(i2.interest_id) AS count 
  FROM interests_users as i1, interests_users as i2
    WHERE i1.interest_id = i2.interest_id AND i1.user_id = 35
  GROUP BY i2.user_id
  ORDER BY count DESC LIMIT 20;

However, this query takes approximately 500 milliseconds to execute with 10,000 users and 500,000 rows in the join table. All indexes and database configuration settings have been tuned to the best of my ability.

I have also tried avoiding the use of joins altogether using the following query:

select user_id,count(interest_id) count
  from interests_users
    where interest_id in (13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,508)
  group by user_id 
  order by count desc 
  limit 20;

But this one is even slower (~800 milliseconds).

How could I best lower the time that I can gather this kind of data to below 100 milliseconds?

I have considered putting this data into a graph database like Neo4j, but I am not sure if that is the easiest solution or if it would even be faster than what I am currently doing.

© Stack Overflow or respective owner

Related posts about sql

Related posts about algorithm