One of the projects I'm working on is online survey engine. It's my first big commercial project on Google App Engine.
I need your advice on how to collect stats and efficiently record them in DataStore without bankrupting me. Initial requirements are:
After user finishes survey client sends list of pairs [ID (int) + PercentHit (double)]. This list shows how close answers of this user match predefined answers of reference answerers (which identified by IDs). I call them "target IDs".
Creator of the survey wants to see aggregated % for given IDs for last hour, particular timeframe or from the beginning of the survey.
Some surveys may have thousands of target/reference answerers.
So I created entity
public class HitsStatsDO implements Serializable
{
@Id
transient private Long id;
transient private Long version = (long) 0;
transient private Long startDate;
@Parent transient private Key parent; // fake parent which contains target id
@Transient int targetId;
private double avgPercent;
private long hitCount;
}
But writing HitsStatsDO for each target from each user would give a lot of data. For instance I had a survey with 3000 targets which was answered by ~4 million people within one week with 300K people taking survey in first day. Even if we assume they were answering it evenly for 24 hours it would give us ~1040 writes/second. Obviously it hits concurrent writes limit of Datastore.
I decided I'll collect data for one hour and save that, that's why there are avgPercent and hitCount in HitsStatsDO. GAE instances are stateless so I had to use dynamic backend instance.
There I have something like this:
// Contains stats for one hour
private class Shard
{
ReadWriteLock lock = new ReentrantReadWriteLock();
Map<Integer, HitsStatsDO> map = new HashMap<Integer, HitsStatsDO>(); // Key is target ID
public void saveToDatastore();
public void updateStats(Long startDate, Map<Integer, Double> hits);
}
and map with shard for current hour and previous hour (which doesn't stay here for long)
private HashMap<Long, Shard> shards = new HashMap<Long, Shard>(); // Key is HitsStatsDO.startDate
So once per hour I dump Shard for previous hour to Datastore.
Plus I have class LifetimeStats which keeps Map<Integer, HitsStatsDO> in memcached where map-key is target ID.
Also in my backend shutdown hook method I dump stats for unfinished hour to Datastore.
There is only one major issue here - I have only ONE backend instance :) It raises following questions on which I'd like to hear your opinion:
Can I do this without using backend instance ?
What if one instance is not enough ?
How can I split data between multiple dynamic backend instances? It hard because I don't know how many I have because Google creates new one as load increases.
I know I can launch exact number of resident backend instances. But how many ? 2, 5, 10 ? What if I have no load at all for a week. Constantly running 10 backend instances is too expensive.
What do I do with data from clients while backend instance is dead/restarting?