Unable to download to any storage directory

We had an issue in which the secondary namenode was not checkpointing the edits file with updates from the primary namenode. In case you are unaware, the secondary namenode (not in an HA configuration) regularly transfers the edits file from the primary name node, reads them, and applies the updates to the fsimage file. In our case, we received an alert from Ambari that the checkpoint had not occurred for several hours. This article is simply a record of how we troubleshot the issue.

From the secondary namenode, try to force a checkpoint. Note that this doesn’t work as it attempts to bind to the port on which the secondary namenode is running (assuming it is, as it was in our case).

[hdfs@cmhlddlkeedg02 ~]$ hadoop secondarynamenode -checkpoint force
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

15/08/12 13:33:59 INFO namenode.SecondaryNameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting SecondaryNameNode
STARTUP_MSG:   host = cmhlddlkeedg02.expdev.local/172.27.2.196
STARTUP_MSG:   args = [-checkpoint, force]
STARTUP_MSG:   version = 2.6.0.2.2.4.2-2



15/08/12 13:34:01 INFO http.HttpServer2: HttpServer.start() threw a non Bind IOException
java.net.BindException: Port in use: cmhlddlkeedg02.expdev.local:50090
        at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:891)
        at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:827)
        at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:276)
        at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:192)
        at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:671)
Caused by: java.net.BindException: Address already in use
        at sun.nio.ch.Net.bind0(Native Method)
        at sun.nio.ch.Net.bind(Net.java:444)
        at sun.nio.ch.Net.bind(Net.java:436)
        at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
        at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
        at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
        at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:886)
        ... 4 more

We then see the process that has it open is the secondary namenode, as expected…

[hdfs@cmhlddlkeedg02 ~]$ netstat -anp | grep 50090
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 172.27.2.196:50090          0.0.0.0:*                   LISTEN      22115/java
tcp        0      0 172.27.2.196:24296          172.27.2.196:50090          TIME_WAIT   -
[hdfs@cmhlddlkeedg02 ~]$ ps -ef | grep 22115
hdfs     20009 19291  0 13:34 pts/1    00:00:00 grep 22115
hdfs     22115     1  0 Jul22 ?        00:22:48 /usr/jdk64/jdk1.7.0_67/bin/java -Dproc_secondarynamenode  org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode

We then identify and check the logs for secondary name node, and see the following exception…

[hdfs@cmhlddlkeedg02 ~]$ ls -lrt /proc/22115/fd | grep log$
l-wx------. 1 hdfs hadoop 64 Aug 11 15:54 339 -> /var/log/hadoop/hdfs/hadoop-hdfs-secondarynamenode-cmhlddlkeedg02.expdev.local.log

2015-08-12 13:32:38,871 ERROR namenode.SecondaryNameNode (SecondaryNameNode.java:doWork(399)) - Exception in doCheckpoint
java.io.IOException: Unable to download to any storage directory
        at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.receiveFile(TransferFsImage.java:505)
        at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl(TransferFsImage.java:431)
        at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:395)

We next turn to our trusty friend strace() to see what the secondary namenode is doing. We see it opens a socket to the primary namenode, and attempts via a REST call to get an edits file (edits_0000000000007192672-0000000000007202362 in the output below)…

22780 13:37:39.444699 connect(355, {sa_family=AF_INET, sin_port=htons(50070), sin_addr=inet_addr("172.27.2.195")}, 16) = -1 EINPROGRESS (Operation now in progress)
22780 13:37:39.445739 sendto(355, "GET /imagetransfer?getedit=1&startTxId=7192672&endTxId=7202362&storageInfo=-60:554790117:0:CID-18f28bd7-c752-4cde-9888-14b4c5c2a70d HTTP/1.1\r\nUser-Agent: Java/1.7.0_67\r\nHost: cmhlddlkeedg01.expdev.local:50070\r\nAccept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2\r\nConnection: keep-alive\r\n\r\n", 298, 0, NULL, 0) = 298
22780 13:37:39.447019 recvfrom(355, "HTTP/1.1 200 OK\r\nCache-Control: no-cache\r\nExpires: Wed, 12 Aug 2015 17:37:39 GMT\r\nDate: Wed, 12 Aug 2015 17:37:39 GMT\r\nPragma: no-cache\r\nExpires: Wed, 12 Aug 2015 17:37:39 GMT\r\nDate: Wed, 12 Aug 2015 17:37:39 GMT\r\nPragma: no-cache\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Length: 1643554\r\nContent-Disposition: attachment; filename=edits_0000000000007192672-0000000000007202362\r\nX-Image-Edits-Name: edits_0000000000007192672-0000000000007202362\r\nServer: Jetty(6.1.26.hwx)\r\n\r\n\377\377\377\304\0\0\0\0\30\0\0\0\f\0\0\0\0\0m\300`\301\3.D\37\0\0\0"..., 8192, 0, NULL, NULL) = 8192
22780 13:37:39.448322 write(2, "java.io.IOException: Unable to download to any storage directory", 64) = 64

We check for the existence of this file on the primary namenode, and find it is there. We also get its checksum…

[root@cmhlddlkeedg01 hadoop]# find . | grep edits_0000000000007192672-0000000000007202362 | xargs cksum
4274517808 1643554 ./hdfs/namenode/current/edits_0000000000007192672-0000000000007202362
[root@cmhlddlkeedg01 hadoop]#

We next try to get the edits file using wget (non-secure cluster). We can successfully get the file, so we also check its checksum and see it is the same as it is on the primary namenode.

[hdfs@cmhlddlkeedg02 ~]$ wget "http://cmhlddlkeedg01:50070/imagetransfer?getedit=1&startTxId=7192672&endTxId=7202362&storageInfo=-60:554790117:0:CID-18f28bd7-c752-4cde-9888-14b4c5c2a70d"
--2015-08-12 13:44:08--  http://cmhlddlkeedg01:50070/imagetransfer?getedit=1&startTxId=7192672&endTxId=7202362&storageInfo=-60:554790117:0:CID-18f28bd7-c752-4cde-9888-14b4c5c2a70d
Resolving cmhlddlkeedg01... 172.27.2.195
Connecting to cmhlddlkeedg01|172.27.2.195|:50070... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1643554 (1.6M) [text/plain]
Saving to: “imagetransfer?getedit=1&startTxId=7192672&endTxId=7202362&storageInfo=-60:554790117:0:CID-18f28bd7-c752-4cde-9888-14b4c5c2a70d”

100%[=======================================================================================================================================>] 1,643,554   --.-K/s   in 0.02s

2015-08-12 13:44:08 (101 MB/s) - “imagetransfer?getedit=1&startTxId=7192672&endTxId=7202362&storageInfo=-60:554790117:0:CID-18f28bd7-c752-4cde-9888-14b4c5c2a70d” saved [1643554/1643554]

[hdfs@cmhlddlkeedg02 ~]$ cksum imagetransfer\?getedit\=1\&startTxId\=7192672\&endTxId\=7202362\&storageInfo\=-60\:554790117\:0\:CID-18f28bd7-c752-4cde-9888-14b4c5c2a70d
4274517808 1643554 imagetransfer?getedit=1&startTxId=7192672&endTxId=7202362&storageInfo=-60:554790117:0:CID-18f28bd7-c752-4cde-9888-14b4c5c2a70d

Other steps we undertook were to check for any user limits such as open files, but found no issues.

We finally opted to restart the secondary namenode, which resolved the issue. What we should have done but did not think of until later is to take a thread dump of the secondary namenode. My guess is we may have seen a BLOCKED thread somewhere, which would have warranted further investigation.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.