File Sync Solution for Batch Processing (ETL)
- by KenFar
I'm looking for a slightly different kind of sync utility - not one designed to keep two directories identical, but rather one intended to keep files flowing from one host to another.
The context is a data warehouse that currently has a custom-developed solution that moves 10,000 files a day, some of which are 1+ gbytes gzipped files, between linux servers via ssh. Files are produced by the extract process, then moved to the transform server where a transform daemon is waiting to pick them up. The same process happens between transform & load. Once the files are moved they are typically archived on the source for a week, and the downstream process likewise moves them to temp then archive as it consumes them. So, my requirements & desires:
It is never used to refresh updated files - only used to deliver new files.
Because it's delivering files to downstream processes - it needs to rename the file once done so that a partial file doesn't get picked up.
In order to simplify recovery, it should keep a copy of the source files - but rename them or move them to another directory.
If the transfer fails (network down, file system full, permissions, file locked, etc), then it should retry periodically - and never fail in a non-recoverable way, or a way that sends the file twice or never sends the file.
Should be able to copy files to 2+ destinations.
Should have a consolidated log so that it's easy to find problems
Should have an optional checksum feature
Any recommendations? Can Unison do this well?