OrangeFS
Linux kernel version 4.6 has just been released, and one of the items mentioned is kernel drivers for the distributed filesystem OrangeFS.
OrangeFS certainly does look interesting. However they claim that “using OrangeFS instead of HDFS … can improve MapReduce performance and …”. Having looked at the OrangeFS docs, this seems somewhat overstated.
OrangeFS is similar in some ways to HDFS; it runs on a cluster of servers which provide a set of metadata and datanode services that “manage” the data, but the data itself is actually stored in a native filesystem, eg ext4.
HDFS divides file content into “blocks”, and stores each block individually. This allows a single file to span the entire available storage capacity if desired; a file is not limited to the size of a single disk or the total size of storage on a single node. The OrangeFS documentation design overview appears to show OrangeFS working in a similar way, and as far as I can tell, PVFS uses the term striping for this feature.
HDFS allows a cluster to be expanded (or shrunk) dynamically without downtime. According to the same link, this is “in planning for version 3”. At the moment, adding nodes to your OrangeFS cluster means downtime.
HDFS provides high-availability by replicating data across multiple machines. According to the OrangeFS wikipedia page, this was added “for immutable files” in version 2.8.5, although interestingly I can find no configuration items related to this feature in the OrangeFS documentation. Googling for “replication” on the user lists provides only this link from 2012. It isn’t clear whether “immutable files” includes append-only files (as with HDFS). I suppose the “native filesystem” could potentially be mounted on distributed block storage which does replication (eg DRBD), but that adds an extra layer of indirection that HDFS doesn’t do - and an extra layer of administration to perform. Using RAID or similar does provide some robustness, but not nearly as much as true data replication.
HDFS is a location-aware filesystem, where a client can query the FS to find out which physical nodes are hosting copies (replication, remember!) of different parts of the file. A cluster management system such as YARN can then execute code to process sections of a file on or near the node that hosts the data itself, significantly reducing network IO. I can see no reference to such a feature in OrangeFS. In addition, the effectiveness of this strategy is greatly enhanced with a filesystem that performs replication, ie where the same chunk of data is stored on multiple hosts, giving the cluster management system a choice of places in which to execute the processing logic.
In summary, OrangeFS seems to be an interesting filesystem. It has a long history behind it, and if it’s fast enough for BlueGene supercomputers, it’s probably fast enough for the rest of us. But I don’t see it as a replacement for HDFS - at least not yet.
Sadly, I couldn’t find any trace of release notes for the various OrangeFS releases. The wikipedia page seems to be the best resource for that(!), although some basic info is also available on the pvfs site.
The OrangeFS faq has some interesting reading, particularly section 7(“fault tolerance), and there is an interesting article on LWN about the design and future of OrangeFS.