How do you efficiently bulk index lookups?

Posted by Liron Shapira on Stack Overflow See other posts from Stack Overflow or by Liron Shapira
Published on 2010-06-16T02:11:27Z Indexed on 2010/06/16 5:02 UTC
Read the original article Hit count: 322

I have these entity kinds:

  • Molecule
  • Atom
  • MoleculeAtom

Given a list(molecule_ids) whose lengths is in the hundreds, I need to get a dict of the form {molecule_id: list(atom_ids)}. Likewise, given a list(atom_ids) whose length is in the hunreds, I need to get a dict of the form {atom_id: list(molecule_ids)}.

Both of these bulk lookups need to happen really fast. Right now I'm doing something like:

atom_ids_by_molecule_id = {}

for molecule_id in molecule_ids:
    moleculeatoms = MoleculeAtom.all().filter('molecule =', db.Key.from_path('molecule', molecule_id)).fetch(1000)
    atom_ids_by_molecule_id[molecule_id] = [
        MoleculeAtom.atom.get_value_for_datastore(ma).id() for ma in moleculeatoms
    ]

Like I said, len(molecule_ids) is in the hundreds. I need to do this kind of bulk index lookup on almost every single request, and I need it to be FAST, and right now it's too slow.

Ideas:

  • Will using a Molecule.atoms ListProperty do what I need? Consider that I am storing additional data on the MoleculeAtom node, and remember it's equally important for me to do the lookup in the molecule->atom and atom->molecule directions.

  • Caching? I tried memcaching lists of atom IDs keyed by molecule ID, but I have tons of atoms and molecules, and the cache can't fit it.

  • How about denormalizing the data by creating a new entity kind whose key name is a molecule ID and whose value is a list of atom IDs? The idea is, calling db.get on 500 keys is probably faster than looping through 500 fetches with filters, right?

© Stack Overflow or respective owner

Related posts about python

Related posts about google-app-engine