How to calculate real-time stats?
- by Diego Jancic
I have a site with millions of users (well, actually it doesn't have any yet, but let's imagine), and I want to calculate some stats like "log-ins in the past hour".
The problem is similar to the one described here: http://highscalability.com/blog/2008/4/19/how-to-build-a-real-time-analytics-system.html
The simplest approach would be to do a select like this:
select count(distinct user_id)
from logs
where date>='20120601 1200' and date <='20120601 1300'
(of course other conditions could apply for the stats, like log-ins per country)
Of course this would be really slow, mainly if it has millions (or even thousands) of rows, and I want to query this every time a page is displayed.
How would you summarize the data? What should go to the (mem)cache?
EDIT: I'm looking for a way to de-normalize the data, or to keep the cache up-to-date. For example I could increment an in-memory variable every time someone logs in, but that would help to know the total amount of logins, not the "logins in the last hour". Hope it's more clear now.