I have a CSV file that needs to be edited by multiple processes at the same time. My question is, how can I do this without introducing race conditions?
It's easy to write to the end of the file without race conditions by open(2)ing it in "a" (O_APPEND) mode and simply write to it. Things get more difficult when removing lines from the file.
The easiest solution is to read the file into memory, make changes to it, and overwrite it back to the file. If another process writes to it after it is in memory, however, that new data will be lost upon overwriting. To further complicate matters, my platform does not support POSIX record locks, checking for file existence is a race condition waiting to happen, rename(2) replaces the destination file if it exists instead of failing, and editing files in-place leaves empty bytes in it unless the remaining bytes are shifted towards the beginning of the file.
My idea for removing a line is this (in pseudocode):
filename = "/home/user/somefile";
file = open(filename, "r");
tmp = open(filename+".tmp", "ax") || die("could not create tmp file"); //"a" is O_APPEND, "x" is O_EXCL|O_CREAT
while(write(tmp, read(file)); //copy the $file to $file+".new"
close(file);
//edit tmp file
unlink(filename) || die("could not unlink file");
file = open(filename, "wx") || die("another process must have written to the file after we copied it."); //"w" is overwrite, "x" is force file creation
while(write(file, read(tmp))); //copy ".tmp" back to the original file
unlink(filename+".tmp") || die("could not unlink tmp file");
Or would I be better off with a simple lock file?
Appender process:
lock = open(filename+".lock", "wx") || die("could not lock file");
file = open(filename, "a");
write(file, "stuff");
close(file);
close(lock);
unlink(filename+".lock");
Editor process:
lock = open(filename+".lock", "wx") || die("could not lock file");
file = open(filename, "rw");
while(contents += read(file));
//edit "contents"
write(file, contents);
close(file);
close(lock);
unlink(filename+".lock");
Both of these rely on an additional file that will be left over if a process terminates before unlinking it, causing other processes to refuse to write to the original file.
In my opinion, these problems are brought on by the fact that the OS allows multiple writable file descriptors to be opened on the same file at the same time, instead of failing if a writable file descriptor is already open. It seems that O_CREAT|O_EXCL is the closest thing to a real solution for preventing filesystem race conditions, aside from POSIX record locks.
Another possible solution is to separate the file into multiple files and directories, so that more granular control can be gained over components (lines, fields) of the file using O_CREAT|O_EXCL. For example, "file/$id/$field" would contain the value of column $field of the line $id. It wouldn't be a CSV file anymore, but it might just work.
Yes, I know I should be using a database for this as databases are built to handle these types of problems, but the program is relatively simple and I was hoping to avoid the overhead.
So, would any of these patterns work? Is there a better way? Any insight into these kinds of problems would be appreciated.