What Is Safe Mode in HDFS?

HDFS is a distributed file system. This means that blocks of the file are replicated between the datanodes of the HDFS, depending on the file's replication factor. The default replication factor in HDFS is 3 (configurable via hdfs-site.xml), so every block is written to 3 different nodes. This creates, among other, a backup capability: if up to two nodes are down - the block can still be served and even re-replicated to other nodes.

What happens if all three nodes that holds certain block are down? HDFS mark this block as unavailable and create a warning, both in logs and its UI (port 50070 in the master node which is called "namenode"). HDFS also has a safety mechanism, in which reaching a certain percentage of unavilable blocks makes the HDFS go to a "safe mode". In this mode the HDFS become a read-only file system, preventing any modification of the file system (create, delete and update files). The percentage of blocks threshold can be configured using the dfs.namenode.safemode.threshold-pct property in hdfs-site.xml which is set by default to 0.999.

Identifying Safe Mode

Safe mode can be identified in two ways: UI and command line. In UI (http://:50070) go to "Overview" tab. In safe mode the following will appear:

HDFS_SafeMode

You can see that safe mode is ON and a description of the cause.

For command line type the following command in the namenode machine:

HDFS_SafeMode

Handling Safe Mode

As described above, HDFS in safe mode does not permit file modification so it is crucial to handle the situation and bring back the system to a functioning state by setting the safe mode to "off".

Dead datanodes

Usually the situation occurrs due to datanodes that are down. You can view the status of the datanodes in the UI, under the "Datanodes" tab or the "Summary" section of the "Overview" tab which states the number of live datanodes.

After idendifying which datanodes are not active, go to the machine of the offline datanode, turn it on and start the datanode process by issuing the following command as the user with hadoop permissions:

$HADOOP_PREFIX/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode

After starting the datanode, go to the namenode machine and let HDFS do a file system check, to identify the changed status and turn the safe mode off. Do so using the hdfs fsck command:

hdfs fsck /

When the system is back to normal state, the last line of the command output will be:

The filesystem under path '/' is HEALTHY

Missing blocks

It may come to a situation where all datanodes are up and running but several files have failed to be written to the various nodes and are considered as corrupted. In such case, starting the datanodes alone is not enough. The corrupted files must be identified and removed from the HDFS. Only after removing them, the HDFS will turn safe mode off.

In order to identify which files have missing blocks, you can use the following command:

hdfs fsck / | grep MISSING

The command will output all missing blocks and their owning files:

HDFS_SafeMode

You can see the file name and path which has missing blocks. You can use that path to remove the file:

hdfs dfs -rm /user/ubuntu/hadoop-hduser-datanode-ip-172-16-1-10.out

Repeat the process for all corrupted files until HDFS notify at the end of fsck command that The filesystem under path '/' is HEALTHY

Summary

HDFS safe mode is a safety mechanism of preventing modifications when the HDFS cluster is unstable. Usually it happens due to unavailable blocks, either because of a dead datanode or a file system corruption. Recovering dead datanodes and deleting corrupted files is crucial to bring back the HDFS to stable state.