I tried writing a "perfect" backup program (below), but ran into
problems (also below). Is there an efficient/working version
of this?:
Assumptions: you're backing up from 'local', which you own and has
limited disk space to 'remote', which has infinite disk space and
belongs to someone else, so you need encryption. Network bandwidth
is finite.
'local' keeps a db of backed-up files w/ this data for each file:
filename, including full path
file's last modified time (mtime)
sha1sum of file's unencrypted contents
sha1sum of file's encrypted contents
Given a list of files to backup (some perhaps already backed up),
the program runs 'find' and gets the full path/mtime for each file
(this is fairly efficient; conversely, computing the sha1sum of each
file would NOT be efficient)
The program discards files whose filename and mtime are in 'local' db.
The program now computes the sha1sum of the (unencrypted contents
of each remaining file.
If the sha1sum matches one in 'local' db, we create a special entry
in 'local' db that points this file/mtime to the file/mtime of the
existing entry. Effectively, we're saying "we have a backup of this
file's contents, but under another filename, so no need to back it
up again".
For each remaining file, we encrypt the file, take the sha1sum of
the encrypted file's contents, rsync the file to its
sha1sum. Example: if the file's encrypted sha1sum was
da39a3ee5e6b4b0d3255bfef95601890afd80709, we'd rsync it to
/some/path/da/39/a3/da39a3ee5e6b4b0d3255bfef95601890afd80709 on
'remote'.
Once the step above succeeds, we add the file to the 'local' db.
Note that we efficiently avoid computing sha1sums and encrypting
unless absolutely necessary.
Note: I don't specify encryption method: this would be user's choice.
The problems:
We must encrypt and backup 'local' db regularly. However, 'local'
db grows quickly and rsync'ing encrypted files is inefficient, since
a small change in 'local' db means a big change in the encrypted
version of 'local' db.
We create a file on 'remote' for each file on 'local', which is
ugly and excessive.
We query 'local' db frequently. Even w/ indexes, these queries are
slow, since we're often making one query for each file. Would be
nice to speed this up by batching queries or something.
Probably other problems that I've now forgotten.