Developing an analytics's system processing large amounts of data - where to start
- by Ryan
Imagine you're writing some sort of Web Analytics system - you're recording raw page hits along with some extra things like tagging cookies etc and then producing stats such as
Which pages got most traffic over a time period
Which referers sent most traffic
Goals completed (goal being a view of a particular page)
And more advanced things like which referers sent the most number of vistors who later hit a goal.
The naieve way of approaching this would be to throw it in a relational database and run queries over it - but that won't scale.
You could pre-calculate everything (have a queue of incoming 'hits' and use to update report tables) - but what if you later change a goal - how could you efficiently re-calculate just the data that would be effected.
Obviously this has been done before ;) so any tips on where to start, methods & examples, architecture, technologies etc.