HDFS datanode startup fails when disks are full
- by mbac
Our HDFS cluster is only 90% full but some datanodes have some disks that are 100% full. That means when we mass reboot the entire cluster some datanodes completely fail to start with a message like this:
2013-10-26 03:58:27,295 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Mkdirs failed to create /mnt/local/sda1/hadoop/dfsdata/blocksBeingWritten
Only three have to fail this way before we start experiencing real data loss.
Currently we workaround it by decreasing the amount of space reserved for the root user but we'll eventually run out. We also run the re-balancer pretty much constantly, but some disks stay stuck at 100% anyway.
Changing the dfs.datanode.failed.volumes.tolerated setting is not the solution as the volume has not failed.
Any ideas?