Speeding up inner-joins and subqueries while restricting row size and table membership
Posted
by hiffy
on Stack Overflow
See other posts from Stack Overflow
or by hiffy
Published on 2010-03-13T23:31:54Z
Indexed on
2010/03/13
23:35 UTC
Read the original article
Hit count: 298
I'm developing an rss feed reader that uses a bayesian filter to filter out boring blog posts.
The Stream table is meant to act as a FIFO buffer from which the webapp will consume 'entries'. I use it to store the temporary relationship between entries, users and bayesian filter classifications.
After a user marks an entry as read, it will be added to the metadata table (so that a user isn't presented with material they have already read), and deleted from the stream table. Every three minutes, a background process will repopulate the Stream table with new entries (i.e. whenever the daemon adds new entries after the checks the rss feeds for updates).
Problem: The query I came up with is hella slow. More importantly, the Stream table only needs to hold one hundred unread entries at a time; it'll reduce duplication, make processing faster and give me some flexibility with how I display the entries.
The query (takes about 9 seconds on 3600 items with no indexes):
insert into stream(entry_id, user_id) select entries.id, subscriptions_users.user_id from entries inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id where subscriptions_users.user_id = 1 and entries.id not in (select entry_id from metadata where metadata.user_id = 1) and entries.id not in (select entry_id from stream where user_id = 1);
The query explained: insert into stream all of the entries from a user's subscription list (subscriptions_users) that the user has not read (i.e. do not exist in metadata) and which do not already exist in the stream.
Attempted solution: adding limit 100 to the end speeds up the query considerably, but upon repeated executions will keep on adding a different set of 100 entries that do not already exist in the table (with each successful query taking longer and longer).
This is close but not quite what I wanted to do.
Does anyone have any advice (nosql?) or know a more efficient way of composing the query?
© Stack Overflow or respective owner