How can I gather client's data on Google App Engine without using Datastore/Backend Instances too much?

Posted by ruslan on Programmers See other posts from Programmers or by ruslan
Published on 2011-11-20T01:31:34Z Indexed on 2011/11/20 10:23 UTC
Read the original article Hit count: 296

One of the projects I'm working on is online survey engine. It's my first big commercial project on Google App Engine.

I need your advice on how to collect stats and efficiently record them in DataStore without bankrupting me. Initial requirements are:

  • After user finishes survey client sends list of pairs [ID (int) + PercentHit (double)]. This list shows how close answers of this user match predefined answers of reference answerers (which identified by IDs). I call them "target IDs".
  • Creator of the survey wants to see aggregated % for given IDs for last hour, particular timeframe or from the beginning of the survey.
  • Some surveys may have thousands of target/reference answerers.

So I created entity

public class HitsStatsDO implements Serializable
{
    @Id
    transient private Long id;
    transient private Long version = (long) 0;

    transient private Long startDate;

    @Parent transient private Key parent;   // fake parent which contains target id
    @Transient int targetId;

    private double avgPercent;
    private long hitCount;
}

But writing HitsStatsDO for each target from each user would give a lot of data. For instance I had a survey with 3000 targets which was answered by ~4 million people within one week with 300K people taking survey in first day. Even if we assume they were answering it evenly for 24 hours it would give us ~1040 writes/second. Obviously it hits concurrent writes limit of Datastore.

I decided I'll collect data for one hour and save that, that's why there are avgPercent and hitCount in HitsStatsDO. GAE instances are stateless so I had to use dynamic backend instance.

There I have something like this:

// Contains stats for one hour
private class Shard
{
    ReadWriteLock lock = new ReentrantReadWriteLock();
    Map<Integer, HitsStatsDO> map = new HashMap<Integer, HitsStatsDO>(); // Key is target ID

    public void saveToDatastore();
    public void updateStats(Long startDate, Map<Integer, Double> hits);
}

and map with shard for current hour and previous hour (which doesn't stay here for long)

private HashMap<Long, Shard> shards = new HashMap<Long, Shard>();   // Key is HitsStatsDO.startDate

So once per hour I dump Shard for previous hour to Datastore.

Plus I have class LifetimeStats which keeps Map<Integer, HitsStatsDO> in memcached where map-key is target ID.

Also in my backend shutdown hook method I dump stats for unfinished hour to Datastore.

There is only one major issue here - I have only ONE backend instance :) It raises following questions on which I'd like to hear your opinion:

  • Can I do this without using backend instance ?
  • What if one instance is not enough ?
  • How can I split data between multiple dynamic backend instances? It hard because I don't know how many I have because Google creates new one as load increases.
  • I know I can launch exact number of resident backend instances. But how many ? 2, 5, 10 ? What if I have no load at all for a week. Constantly running 10 backend instances is too expensive.
  • What do I do with data from clients while backend instance is dead/restarting?

© Programmers or respective owner

Related posts about architecture

Related posts about Cloud