I'm relatively new to StackExchange and not sure if it's appropriate place to ask design question. Site gives me a hint "The question you're asking appears subjective and is likely to be closed". Please let me know.
Anyway.. One of the projects I'm working on is online survey engine. It's my first big commercial project on Google App Engine.
I need your advice on how to collect stats and efficiently record them in DataStore without bankrupting me. Initial requirements are:
After user finishes survey client sends list of pairs [ID (int) + PercentHit (double)]. This list shows how close answers of this user match predefined answers of reference answerers (which identified by IDs). I call them "target IDs".
Creator of the survey wants to see aggregated % for given IDs for last hour, particular timeframe or from the beginning of the survey.
Some surveys may have thousands of target/reference answerers.
So I created entity
public class HitsStatsDO implements Serializable
{
@Id
transient private Long id;
transient private Long version = (long) 0;
transient private Long startDate;
@Parent transient private Key parent; // fake parent which contains target id
@Transient int targetId;
private double avgPercent;
private long hitCount;
}
But writing HitsStatsDO for each target from each user would give a lot of data. For instance I had a survey with 3000 targets which was answered by ~4 million people within one week with 300K people taking survey in first day. Even if we assume they were answering it evenly for 24 hours it would give us ~1040 writes/second. Obviously it hits concurrent writes limit of Datastore.
I decided I'll collect data for one hour and save that, that's why there are avgPercent and hitCount in HitsStatsDO. GAE instances are stateless so I had to use dynamic backend instance.
There I have something like this:
// Contains stats for one hour
private class Shard
{
ReadWriteLock lock = new ReentrantReadWriteLock();
Map<Integer, HitsStatsDO> map = new HashMap<Integer, HitsStatsDO>(); // Key is target ID
public void saveToDatastore();
public void updateStats(Long startDate, Map<Integer, Double> hits);
}
and map with shard for current hour and previous hour (which doesn't stay here for long)
private HashMap<Long, Shard> shards = new HashMap<Long, Shard>(); // Key is HitsStatsDO.startDate
So once per hour I dump Shard for previous hour to Datastore.
Plus I have class LifetimeStats which keeps Map<Integer, HitsStatsDO> in memcached where map-key is target ID.
Also in my backend shutdown hook method I dump stats for unfinished hour to Datastore.
There is only one major issue here - I have only ONE backend instance :) It raises following questions on which I'd like to hear your opinion:
Can I do this without using backend instance ?
What if one instance is not enough ?
How can I split data between multiple dynamic backend instances? It hard because I don't know how many I have because Google creates new one as load increases.
I know I can launch exact number of resident backend instances. But how many ? 2, 5, 10 ? What if I have no load at all for a week. Constantly running 10 backend instances is too expensive.
What do I do with data from clients while backend instance is dead/restarting?
Thank you very much in advance for your thoughts.