Cross-platform distributed fault-tolerant (disconnected operation/local cache) filesystem
- by Adrian Frühwirth
We are facing a design "challenge" where we are required to set up a storage solution with the following properties:
What we need
HA
a scalable storage backend
offline/disconnected operation on the client to account for network outages
cross-platform access
client-side access from certainly Windows (probably XP upwards), possibly Linux
backend integrates with AD/LDAP (permission management (user/group management, ...))
should work reasonably well over slow WAN-links
Another problem is that we don't really know all possible use cases here, if people need to be able to have concurrent access to shared files or if they will only be accessing their own files, so a possible solution needs to account for concurrent access and how conflict management would look in this case from a user's point of view.
This two years old blog posts sums up the impression that I have been getting during the last couple of days of research, that there are lots of current übercool projects implementing (non-Windows) clustered petabyte-capable blob-storage solutions but that there is none that supports disconnected operation nicely and natively, but I am hoping that we have missed an obvious solution.
What we have tried
OpenAFS
We figured that we want a distributed network filesystem with a local cache and tested OpenAFS (which, as the only currently "stable" DFS supporting disconnected operation, seemed the way to go) for a week but there are several problems with it:
it's a real pain to set up
there are no official RHEL/CentOS packages
the package of the current stable version 1.6.5.1 from elrepo randomly kernel panics on fresh installs, this is an absolute no-go
Windows support (including the required Kerberos packages) is mystical. The current client for the 1.6 branch does not run on Windows 8, the current client for the 1.7 does but it just randomly crashes. After that experience we didn't even bother testing on XP and Windows 7. Suffice to say, we couldn't get it working and the whole setup has been so unstable and complicated to setup that it's just not an option for production.
Samba + Unison
Since OpenAFS was a complete disaster and no other DFS seems to support disconnected operation we went for a simpler idea that would sync files against a Samba server using Unison. This has the following advantages:
Samba integrates with ADs; it's a pain but can be done.
Samba solves the problem of remotely accessing the storage from Windows but introduces another SPOF and does not address the actual storage problem. We could probably stick any clustered FS underneath Samba, but that means we need a HA Samba setup on top of that to maintain HA which probably adds a lot of additional complexity. I vaguely remember trying to implement redundancy with Samba before and I could not silently failover between servers.
Even when online, you are working with local files which will result in more conflicts than would be necessary if a local cache were only touched when disconnected
It's not automatic. We cannot expect users to manually sync their files using the (functional, but not-so-pretty) GTK GUI on a regular basis. I attempted to semi-automate the process using the Windows task scheduler, but you cannot really do it in a satisfactory way.
On top of that, the way Unison works makes syncing against Samba a costly operation, so I am afraid that it just doesn't scale very well or even at all.
Samba + "Offline Files"
After that we became a little desparate and gave Windows "offline files" a chance. We figured that having something that is inbuilt into the OS would reduce administrative efforts, helps blaming someone else when it's not working properly and should just work since people have been using this for years. Right? Wrong. We really wanted it to work, but it just doesn't. 30 minutes of copying files around and unplugging network cables/disabling network interfaces left us with
(silent! there is only a tiny notification in Windows explorer in the statusbar, which doesn't even open Sync Center if you click on it!) undeletable files on the server (!) and
conflicts that should not even be conflicts.
In the end, we had one successful sync of a tiny text file, everything else just exploded horribly.
Beyond that, there are other problems:
Microsoft admits that "offline files" in Windows XP cannot cope with "large files" and therefore does not cache/sync them at all which would mean those files become unavailable if the connection drop
In Windows 7 the feature is only available in the Professional/Ultimate/Enterprise editions.
Summary
Unless there is another fault-tolerant DFS that supports Windows natively I assume that stacking a HA Samba cluster on top of something like GlusterFS/Lustre/whatnot is the only option, but I hope that I am wrong here. How do other companies allow fault-tolerant network access to redundant storage in a heterogeneous environment with Windows?