Sharding / indexing strategy for multi-faceted search
- by Graham
I'm currently thinking about our database structure and how we modify it for scale. Specifically, we're thinking about using ElasticSearch to provide our search functionality.
One common pattern with ElasticSearch seems to be the 'user-routing' pattern; that is, using routing to ensure that any one user's data resides on the same shard. This is great for client-specific search e.g. Gmail.
Our application has a constraint such that any user will have a maximum of a few thousand documents, so this pattern seems like a good candidate. However, our search needs to work across all users, as well as targeting a specific user (so I might search my content, Alice's content, or all content). Similarly, we need to provide full-text search across any timeframe; recent months to several years ago.
I'm thinking of combining the 'user-routing' and 'index-per-time-interval' patterns:
I create an index for each month
By default, searches are aliased against the most recent X months
If no results are found, we can search against previous X months
As we grow, we can reduce the interval X
Each document is routed by the user ID
So, this should let us do the following:
search by user. This will search all indeces across 1 shard
search by time. This will search ~2 indeces (by default) across all shards
Is this a reasonable approach, considering we may scale to multi-million+ documents? Or should I be denormalizing the data somehow, so that user searches are performed on a totally seperate index from date searches?
Thanks for any pros-cons of the above scenario.