I have an application where users can select a variety of interests from around 300 possible interests. Each selected interest is stored in a join table containing the columns user_id and interest_id.
Typical users select around 50 interests out of the 300.
I would like to build a system where users can find the top 20 users that have the most interests in common with them.
Right now I am able to accomplish this using the following query:
SELECT i2.user_id, count(i2.interest_id) AS count
FROM interests_users as i1, interests_users as i2
WHERE i1.interest_id = i2.interest_id AND i1.user_id = 35
GROUP BY i2.user_id
ORDER BY count DESC LIMIT 20;
However, this query takes approximately 500 milliseconds to execute with 10,000 users and 500,000 rows in the join table. All indexes and database configuration settings have been tuned to the best of my ability.
I have also tried avoiding the use of joins altogether using the following query:
select user_id,count(interest_id) count
from interests_users
where interest_id in (13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,508)
group by user_id
order by count desc
limit 20;
But this one is even slower (~800 milliseconds).
How could I best lower the time that I can gather this kind of data to below 100 milliseconds?
I have considered putting this data into a graph database like Neo4j, but I am not sure if that is the easiest solution or if it would even be faster than what I am currently doing.