make-like build tools for data?
- by miku
Make is a standard tools for building software. But
make decides whether a target needs to be regenerated by comparing file modification times.
Are there any proven, preferably small tools that handle builds not for software but for data? Something that regenerates targets not only on mod times but on certain other properties (e.g. completeness). (Or alternatively some paper that describes such a tool.)
As illustration: I'd like to automate the following process:
get data (e.g. a tarball) from some regularly updated source
copy somewhere if it's not there (based e.g. on some filename-scheme)
convert the files to different format (but only if there aren't successfully converted ones there - e.g. from a previous attempt - custom comparison routine)
for each file find a certain data element and fetch some additional file from say an URL, but only if that hasn't been downloaded yet (decide on existence of file and file "freshness")
finally compute something (e.g. word count for something identifiable and store it in the database, but only if the DB does not have an entry for that exact ID yet)
Observations:
there are different stages
each stage is usually simple to compute or implement in isolation
each stage may be simple, but the data volume may be large
each stage may produce a few errors
each stage may have different signals, on when (re)processing is needed
Requirements:
builds should be interruptable and idempotent (== robust)
when interrupted, already processed objects should be reused to speedup the next run
data paths should be easy to adjust (simple syntax, nothing new to learn, internal dsl would be ok)
some form of dependency graph, that describes the process would be nice for later visualizations
should leverage existing programs, if possible
I've done some research on make alternatives like rake and have worked a lot with ant and maven in the past. All these tools naturally focus on code and software build, not on data builds. A system we have in place now for a task similar to the above is pretty much just shell scripts, which are compact (and are a ok glue for a variety of other programs written in other languages), so I wonder if worse is better?