How to best design a date/geographic proximity query on GAE?
- by Dane
Hi all,
I'm building a directory for finding athletic tournaments on GAE with
web2py and a Flex front end. The user selects a location, a radius, and a maximum
date from a set of choices. I have a basic version of this query implemented, but it's
inefficient and slow. One way I know I can improve it is by condensing
the many individual queries I'm using to assemble the objects into
bulk queries. I just learned that was possible. But I'm also thinking about a more extensive redesign that utilizes memcache.
The main problem is that I can't query the datastore by location
because GAE won't allow multiple numerical comparison statements
(<,<=,=,) in one query. I'm already using one for date, and I'd need
TWO to check both latitude and longitude, so it's a no go. Currently,
my algorithm looks like this:
1.) Query by date and select
2.) Use destination function from geopy's distance module to find the
max and min latitude and longitudes for supplied distance
3.) Loop through results and remove all with lat/lng outside max/min
4.) Loop through again and use distance function to check exact
distance, because step 2 will include some areas outside the radius.
Remove results outside supplied distance (is this 2/3/4 combination
inefficent?)
5.) Assemble many-to-many lists and attach to objects (this is where I
need to switch to bulk operations)
6.) Return to client
Here's my plan for using memcache.. let me know if I'm way out in left
field on this as I have no prior experience with memcache or server
caching in general.
-Keep a list in the cache filled with "geo objects" that represent all
my data. These have five properties: latitude, longitude, event_id,
event_type (in anticipation of expanding beyond tournaments), and
start_date. This list will be sorted by date.
-Also keep a dict of pointers in the cache which represent the start
and end indices in the cache for all the date ranges my app uses (next
week, 2 weeks, month, 3 months, 6 months, year, 2 years).
-Have a scheduled task that updates the pointers daily at 12am.
-Add new inserts to the cache as well as the datastore; update
pointers.
Using this design, the algorithm would now look like:
1.) Use pointers to slice off appropriate chunk of list based on
supplied date.
2-4.) Same as above algorithm, except with geo objects
5.) Use bulk operation to select full tournaments using remaining geo
objects' event_ids
6.) Assemble many-to-manys
7.) Return to client
Thoughts on this approach? Many thanks for reading and any advice you
can give.
-Dane