{"id":2817,"date":"2013-05-08T13:13:18","date_gmt":"2013-05-08T18:13:18","guid":{"rendered":"http:\/\/appcrawler.com\/wordpress\/?p=2817"},"modified":"2013-05-08T14:53:39","modified_gmt":"2013-05-08T19:53:39","slug":"does-hadoophdfs-distribute-writes-to-all-data-nodes-on-ingest","status":"publish","type":"post","link":"http:\/\/appcrawler.com\/wordpress\/2013\/05\/08\/does-hadoophdfs-distribute-writes-to-all-data-nodes-on-ingest\/","title":{"rendered":"Does hadoop\/HDFS distribute writes to all data nodes on ingest?"},"content":{"rendered":"<p>I like simple, command line test cases.  Lather, rinse, repeat (do any shampoo bottles actually have that anymore \ud83d\ude42 ?)<\/p>\n<p>I wanted to ensure I could prove that ingests to hadoop actually didn&#8217;t send everything through the name node, which would be a bottleneck.<\/p>\n<p>I proved that it does in fact, distribute writes over all nodes by the following test case using hadoop-0.23.7.<\/p>\n<p>From a client node (not a data node, or any part of the cluster) with a hadoop installation&#8230;<\/p>\n<p>Set up your CLASSPATH&#8230;<\/p>\n<pre lang=\"text\">\r\n#!\/bin\/sh\r\n\r\nexport CLASSPATH=.\r\n\r\nfor f in $HADOOP_HOME\/share\/hadoop\/common\/lib\/*; do export CLASSPATH=$CLASSPATH:${f}; done\r\nfor f in $HADOOP_HOME\/share\/hadoop\/common\/*; do export CLASSPATH=$CLASSPATH:${f}; done\r\nfor f in $HADOOP_HOME\/share\/hadoop\/hdfs\/*; do export CLASSPATH=$CLASSPATH:${f}; done\r\n\r\n<\/pre>\n<p>Compile the following java class&#8230;<\/p>\n<pre lang=\"java\" line=\"1\">\r\nimport java.io.*;\r\nimport java.util.*;\r\nimport org.apache.hadoop.conf.*;\r\nimport org.apache.hadoop.fs.*;\r\n\r\npublic class HDFSWriteTest {\r\n  public static void main(String[] args) throws IOException {\r\n    Configuration conf = new Configuration();\r\n    FileSystem fs = FileSystem.get(conf);\r\n    System.out.println(fs.getUri());\r\n    Path file = new Path(args[0]);\r\n    \/\/set file block size to 1048576 (1MB) to ensure the writes are more likely to go multiple nodes\r\n    FSDataOutputStream outStream = fs.create(file,false, 4096, (short)3, (long)1048576);\r\n    FileInputStream fstream = new FileInputStream(\"vals.txt\");\r\n    DataInputStream in = new DataInputStream(fstream);\r\n    BufferedReader br = new BufferedReader(new InputStreamReader(in));\r\n    String strLine;\r\n    while ((strLine = br.readLine()) != null) {\r\n      outStream.writeUTF(strLine);\r\n    }\r\n    outStream.close();\r\n    fs.close();\r\n  }\r\n}\r\n<\/pre>\n<p>Load 800,000 rows into the source file we will load into hadoop\/HDFS&#8230;<\/p>\n<pre lang=\"text\">\r\n# for i in {1..800000}; do echo \"value number ${i}\" >> vals.txt ; done &\r\n<\/pre>\n<p>Turn on tcpdump on the name node and at least one data node in the hadoop cluster&#8230;<\/p>\n<pre lang=\"text\">\r\n# tcpdump -A -s 1500 -nnn not port 22 > output.txt\r\n<\/pre>\n<p>Run the java class&#8230;<\/p>\n<pre lang=\"text\">\r\n$HADOOP_HOME\/bin\/hadoop HDFSWriteTest foobar.txt\r\n<\/pre>\n<p>CTRL-C the tcpdump commands on each host where they are running in the hadoop cluster, then analyze the output on each with the following.  Change 192.168.3.100 to whatever your client machine (the one from which you ran the test) IP address is&#8230;<\/p>\n<pre lang=\"text\">\r\nawk '{if ($3 ~ \"192.168.3.100\") {l=1} else if (l == 1 && $0 ~ \"value number\") {s++;l=0}} END {print s}' output.txt\r\n<\/pre>\n<p>You should see traffic from the client machine to the respective datanodes on which you ran tcpdump.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I like simple, command line test cases. Lather, rinse, repeat (do any shampoo bottles actually have that anymore \ud83d\ude42 ?) I wanted to ensure I could prove that ingests to hadoop actually didn&#8217;t send everything through the name node, which&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"http:\/\/appcrawler.com\/wordpress\/2013\/05\/08\/does-hadoophdfs-distribute-writes-to-all-data-nodes-on-ingest\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"footnotes":""},"categories":[19,21,25],"tags":[],"_links":{"self":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/2817"}],"collection":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/comments?post=2817"}],"version-history":[{"count":20,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/2817\/revisions"}],"predecessor-version":[{"id":2845,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/2817\/revisions\/2845"}],"wp:attachment":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/media?parent=2817"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/categories?post=2817"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/tags?post=2817"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}