I have an hourly cron job that copies about 40GB of data from a source folder into a new folder with the hour appended on the end.
When it's done, the job prunes anything older than 24 hours. This data changes very often during work hours and is on a samba file share. Here's how the folder structure looks:
\server\Version.1
\server\Version.2
\server\Version.3
...
\server\Version.24
The contents of each new folder compared to the last one usually doesn't change very much, since this is a hourly job.
Now you might be thinking that I'm an idiot for setting dreaming this up. Truth is, I just found out. It's actually been used for years and is so incredibly simple, anyone could delete the ENTIRE 40GB share (imagine that dialog spooling up... deleting thousands and thousands of files) and it would actually be faster to restore by moving the latest copy back to the source than it took to delete.
Brilliant!
Now to top this off, I need to efficiently replicate this 960GB of "mostly similar" data to a remote server over WAN link, with the replication happening as close to real-time as possible -- think hot spare, disaster recovery, etc.
My first thought was rsync.
Total failure.
Rsync sees it sees a deletion of the folder that is 24 hours old and the addition of a new folder with 30GB of data to sync! I also looked at rdiff-backup and unison, they both appear to use similar algorithms and do not keep enough meta-data to do this intelligently.
Best thing that I can find "out of the box" to do this is Windows Server "Distributed Filesystem Replication" which uses "Remote Differential Compression" -- After reading the background information on how this works, it actually looks like exactly what I need.
Problem: Both servers are running Linux. D'oh! One approach to this I'm looking at is this, say it's 5AM and the cron job finishes:
New Version.5 folder arrives at on local server
SSH to remote server and copy Version.4 to Version.5
Run rsync on the local server pushing changes to the remote server. Rsync finally knows to do a differential copy between Version.4 and Version.5
Is there a smarter way to replicate Samba shares as close to real-time as possible?
Anything out there that does "Remote Differential Compression" on Linux?