Getting started with map reduce

Map Reduce is a computer technology algorithm that is built on the age old concept of functional programming.  The goals of functional programming can be simplistically described as:

  • every program returns a value
  • “map” lists of values to functions

Map Reduce uses pieces of this methodology, defined as the following:

Map – This function usually transforms or generates a value for each key in the list

Reduce – This function aggregates the data for the end user

The canonical example used in MapReduce tutorials is word count.  Essentially, a “map” function takes a given document and builds a list of words in the document.  This list is then passed to a reduce function which generates another list of the count of each word in the document.

Conceptually, this is the same as a Unix command line such as the following:

cat document.txt | awk ‘{for (i = 1; i < NF; i++) print $i}’ | sort | uniq –c

In the example above, our map function is the awk piece, while the sort and uniq calls represent the reduce function.

We could also write a python script such as the following to do the same thing.

#!/usr/bin/env python

words = []

for line in open("document.txt"):
  tmp = line.split()
  for w in tmp:
    words.append(w);

sortedWords = sorted(words)
lastWord = ""
count = 0

for word in sortedWords:
  if word == lastWord:
    count = count + 1
  else:
    print count,lastWord
    lastWord = word
    count = 1

The difference between the two is that in the first case, the data was streamed from the map piece to the reduce piece, rather than being processed sequentially.  This can be much faster.

If we can partition our workload by some way, then we can also parallelize our work.  When we load our list of words, perhaps we can randomly assign a word to a list based on some algorithm.  For example, if the word is less than three characters long, it is placed in list A.  If it is equal to or longer than three characters, we place it in list B.  We can then pass each list to the reducer, which can work on each list and combine them in the end to be passed back to our user.

This is (using a very loose definition) how Google indexes pages it crawls on the internet.  Using this methodology, they can index a *lot* of content very quickly.  That is where MapReduce is most (and perhaps only) valuable; extremely large datasets.

It can often be difficult to come up with a test case where MapReduce performs significantly better than its sequential equivalent using the same data.  This is because there is overhead in the handoff between the map and reduce components.  From an Oracle database perspective, it is the same issue that is sometimes found when using its parallel query execution feature.  The query has to be one that would take some to complete before its usefulness becomes evident.

HADOOP

Hadoop is a set of java classes that implement the model of map reduce.  It supplies a distributed filesystem for each node to have the partitioned set of data on which it is assigned to work, as well as the job scheduling between nodes.  The developer must simply provide the map and reduce classes to the hadoop infrastructure.  These do not have to be written in java.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.