File Sync Solution for Batch Processing (ETL)
Posted
by
KenFar
on Super User
See other posts from Super User
or by KenFar
Published on 2012-06-07T04:45:48Z
Indexed on
2012/06/09
16:42 UTC
Read the original article
Hit count: 307
I'm looking for a slightly different kind of sync utility - not one designed to keep two directories identical, but rather one intended to keep files flowing from one host to another.
The context is a data warehouse that currently has a custom-developed solution that moves 10,000 files a day, some of which are 1+ gbytes gzipped files, between linux servers via ssh. Files are produced by the extract process, then moved to the transform server where a transform daemon is waiting to pick them up. The same process happens between transform & load. Once the files are moved they are typically archived on the source for a week, and the downstream process likewise moves them to temp then archive as it consumes them. So, my requirements & desires:
- It is never used to refresh updated files - only used to deliver new files.
- Because it's delivering files to downstream processes - it needs to rename the file once done so that a partial file doesn't get picked up.
- In order to simplify recovery, it should keep a copy of the source files - but rename them or move them to another directory.
- If the transfer fails (network down, file system full, permissions, file locked, etc), then it should retry periodically - and never fail in a non-recoverable way, or a way that sends the file twice or never sends the file.
- Should be able to copy files to 2+ destinations.
- Should have a consolidated log so that it's easy to find problems
- Should have an optional checksum feature
Any recommendations? Can Unison do this well?
© Super User or respective owner