Caching factory design
- by max
I have a factory class XFactory that creates objects of class X. Instances of X are very large, so the main purpose of the factory is to cache them, as transparently to the client code as possible. Objects of class X are immutable, so the following code seems reasonable:
# module xfactory.py
import x
class XFactory:
_registry = {}
def get_x(self, arg1, arg2, use_cache = True):
if use_cache:
hash_id = hash((arg1, arg2))
if hash_id in _registry:
return _registry[hash_id]
obj = x.X(arg1, arg2)
_registry[hash_id] = obj
return obj
# module x.py
class X:
# ...
Is it a good pattern? (I know it's not the actual Factory Pattern.) Is there anything I should change?
Now, I find that sometimes I want to cache X objects to disk. I'll use pickle for that purpose, and store as values in the _registry the filenames of the pickled objects instead of references to the objects. Of course, _registry itself would have to be stored persistently (perhaps in a pickle file of its own, in a text file, in a database, or simply by giving pickle files the filenames that contain hash_id).
Except now the validity of the cached object depends not only on the parameters passed to get_x(), but also on the version of the code that created these objects.
Strictly speaking, even a memory-cached object could become invalid if someone modifies x.py or any of its dependencies, and reloads it while the program is running. So far I ignored this danger since it seems unlikely for my application. But I certainly cannot ignore it when my objects are cached to persistent storage.
What can I do? I suppose I could make the hash_id more robust by calculating hash of a tuple that contains arguments arg1 and arg2, as well as the filename and last modified date for x.py and every module and data file that it (recursively) depends on. To help delete cache files that won't ever be useful again, I'd add to the _registry the unhashed representation of the modified dates for each record.
But even this solution isn't 100% safe since theoretically someone might load a module dynamically, and I wouldn't know about it from statically analyzing the source code. If I go all out and assume every file in the project is a dependency, the mechanism will still break if some module grabs data from an external website, etc.).
In addition, the frequency of changes in x.py and its dependencies is quite high, leading to heavy cache invalidation.
Thus, I figured I might as well give up some safety, and only invalidate the cache only when there is an obvious mismatch. This means that class X would have a class-level cache validation identifier that should be changed whenever the developer believes a change happened that should invalidate the cache. (With multiple developers, a separate invalidation identifier is required for each.) This identifier is hashed along with arg1 and arg2 and becomes part of the hash keys stored in _registry.
Since developers may forget to update the validation identifier or not realize that they invalidated existing cache, it would seem better to add another validation mechanism: class X can have a method that returns all the known "traits" of X. For instance, if X is a table, I might add the names of all the columns. The hash calculation will include the traits as well.
I can write this code, but I am afraid that I'm missing something important; and I'm also wondering if perhaps there's a framework or package that can do all of this stuff already. Ideally, I'd like to combine in-memory and disk-based caching.