Reliable file copy (move) process - mostly Unix/Linux
- by mfinni
Short story : We have a need for a rock-solid reliable file mover process. We have source directories that are often being written to that we need to move files from. The files come in pairs - a big binary, and a small XML index. We get a CTL file that defines these file bundles. There is a process that operates on the files once they are in the destination directory; that gets rid of them when it's done. Would rsync do the best job, or do we need to get more complex? Long story as follows :
We have multiple sources to pull from : one set of directories are on a Windows machine (that does have Cygwin and an SSH daemon), and a whole pile of directories are on a set of SFTP servers (Most of these are also Windows.) Our destinations are a list of directories on AIX servers.
We used to use a very reliable Perl script on the Windows/Cygwin machine when it was our only source. However, we're working on getting rid of that machine, and there are other sources now, the SFTP servers, that we cannot presently run our own scripts on.
For security reasons, we can't run the copy jobs on our AIX servers - they have no access to the source servers. We currently have a homegrown Java program on a Linux machine that uses SFTP to pull from the various new SFTP source directories, copies to a local tmp directory, verifies that everything is present, then copies that to the AIX machines, and then deletes the files from the source. However, we're finding any number of bugs or poorly-handled error checking. None of us are Java experts, so fixing/improving this may be difficult.
Concerns for us are:
With a remote source (SFTP), will rsync leave alone any file still being written? Some of these files are large.
From reading the docs, it seems like rysnc will be very good about not removing the source until the destination is reliably written. Does anyone have experience confirming or disproving this?
Additional info We will be concerned about the ingestion process that operates on the files once they are in the destination directory. We don't want it operating on files while we are in the process of copying them; it waits until the small XML index file is present. Our current copy job are supposed to copy the XML file last.
Sometimes the network has problems, sometimes the SFTP source servers crap out on us. Sometimes we typo the config files and a destination directory doesn't exist. We never want to lose a file due to this sort of error.
We need good logs
If you were presented with this, would you just script up some rsync? Or would you build or buy a tool, and if so, what would it be (or what technologies would it use?) I (and others on my team) are decent with Perl.