Architecture for analysing search result impressions/clicks to improve future searches
- by Hais
We have a large database of items (10m+) stored in MySQL and intend to implement search on metadata on these items, taking advantage of something like Sphinx. The dataset will be changing slightly on a daily basis so Sphinx will be re-indexing daily.
However we want the algorithm to self-learn and improve search results by analysing impression and click data so that we provide better results for our customers on that search term, and possibly other similar search terms too.
I've been reading up on Hadoop and it seems like it has the potential to crunch all this data, although I'm still unsure how to approach it.
Amazon has tutorials for compiling impression vs click data using MapReduce but I can't see how to get this data in a useable format.
My idea is that when a search term comes in I query Sphinx to get all the matching items from the dataset, then query the analytics (compiled on an hourly basis or similar) so that we know the most popular items for that search term, then cache the final results using something like Memcached, Membase or similar.
Am I along the right lines here?