Does hadoop/HDFS distribute writes to all data nodes on ingest?

I like simple, command line test cases. Lather, rinse, repeat (do any shampoo bottles actually have that anymore 🙂 ?)

I wanted to ensure I could prove that ingests to hadoop actually didn’t send everything through the name node, which would be a bottleneck.

I proved that it does in fact, distribute writes over all nodes by the following test case using hadoop-0.23.7.

From a client node (not a data node, or any part of the cluster) with a hadoop installation…

Set up your CLASSPATH…

#!/bin/sh

export CLASSPATH=.

for f in $HADOOP_HOME/share/hadoop/common/lib/*; do export CLASSPATH=$CLASSPATH:${f}; done
for f in $HADOOP_HOME/share/hadoop/common/*; do export CLASSPATH=$CLASSPATH:${f}; done
for f in $HADOOP_HOME/share/hadoop/hdfs/*; do export CLASSPATH=$CLASSPATH:${f}; done

Compile the following java class…

import java.io.*;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;

public class HDFSWriteTest {
  public static void main(String[] args) throws IOException {
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);
    System.out.println(fs.getUri());
    Path file = new Path(args[0]);
    //set file block size to 1048576 (1MB) to ensure the writes are more likely to go multiple nodes
    FSDataOutputStream outStream = fs.create(file,false, 4096, (short)3, (long)1048576);
    FileInputStream fstream = new FileInputStream("vals.txt");
    DataInputStream in = new DataInputStream(fstream);
    BufferedReader br = new BufferedReader(new InputStreamReader(in));
    String strLine;
    while ((strLine = br.readLine()) != null) {
      outStream.writeUTF(strLine);
    }
    outStream.close();
    fs.close();
  }
}

Load 800,000 rows into the source file we will load into hadoop/HDFS…

# for i in {1..800000}; do echo "value number ${i}" >> vals.txt ; done &

Turn on tcpdump on the name node and at least one data node in the hadoop cluster…

# tcpdump -A -s 1500 -nnn not port 22 > output.txt

Run the java class…

$HADOOP_HOME/bin/hadoop HDFSWriteTest foobar.txt

CTRL-C the tcpdump commands on each host where they are running in the hadoop cluster, then analyze the output on each with the following. Change 192.168.3.100 to whatever your client machine (the one from which you ran the test) IP address is…

awk '{if ($3 ~ "192.168.3.100") {l=1} else if (l == 1 && $0 ~ "value number") {s++;l=0}} END {print s}' output.txt

You should see traffic from the client machine to the respective datanodes on which you ran tcpdump.

1 comment for “Does hadoop/HDFS distribute writes to all data nodes on ingest?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.