Dear Serverfault community,
After researching a number of distributed file systems for deployment in a production environment with the main purpose of performing both batch and real-time distributed computing I've identified the following list as potential candidates, mainly on maturity, license and support:
Ceph
Lustre
GlusterFS
HDFS
FhGFS
MooseFS
XtreemFS
The key properties that our system should exhibit:
an open source, liberally licensed, yet production ready, e.g. a mature, reliable, community and commercially supported solution;
ability to run on commodity hardware, preferably be designed for it;
provide high availability of the data with the most focus on reads;
high scalability, so operation over multiple data centres, possibly on a global scale;
removal of single points of failure with the use of replication and distribution of (meta-)data, e.g. provide fault-tolerance.
The sensitivity points that were identified, and resulted in the following questions, are:
transparency to the processing layer / application with respect to data locality, e.g. know where data is physically located on a server level, mainly for resource allocation and fast processing, high performance, how can this be accomplished? Do you from experience know what solutions provide this transparency and to what extent?
posix compliance, or conformance, is mentioned on the wiki pages of most of the above listed solutions. The question here mainly is, how relevant is support for the posix standard? Hadoop for example isn't posix compliant by design, what are the pro's and con's?
what about the difference between synchronous and asynchronous opeartion of a distributed file system. Though a synchronous distributed file system has the preference because of reliability it also imposes certain limitations with respect to scalability. What would be, from your expertise, the way to go on this?
I'm looking forward to your replies. Thanks in advance! :)
With kind regards,
Tim van Elteren