{"id":1160,"date":"2011-04-23T08:00:57","date_gmt":"2011-04-23T13:00:57","guid":{"rendered":"http:\/\/appcrawler.com\/wordpress\/?p=1160"},"modified":"2011-07-06T09:28:36","modified_gmt":"2011-07-06T14:28:36","slug":"getting-started-with-map-reduce","status":"publish","type":"post","link":"http:\/\/appcrawler.com\/wordpress\/2011\/04\/23\/getting-started-with-map-reduce\/","title":{"rendered":"Getting started with map reduce"},"content":{"rendered":"<p>Map Reduce is a computer technology algorithm that is built on the age old concept of functional programming.\u00a0 The goals of functional programming can be simplistically described as:<\/p>\n<ul>\n<li>every program returns a value<\/li>\n<li>\u201cmap\u201d lists of values to functions<\/li>\n<\/ul>\n<p>Map Reduce uses pieces of this methodology, defined as the following:<\/p>\n<p>Map \u2013 This function usually transforms or generates a value for each key in the list<\/p>\n<p>Reduce \u2013 This function aggregates the data for the end user<\/p>\n<p>The canonical example used in MapReduce tutorials is word count.\u00a0 Essentially, a \u201cmap\u201d function takes a given document and builds a list of words in the document.\u00a0 This list is then passed to a reduce function which generates another list of the count of each word in the document.<\/p>\n<p>Conceptually, this is the same as a Unix command line such as the following:<\/p>\n<pre lang=\"awk\">\r\ncat document.txt | awk \u2018{for (i = 1; i < NF; i++) print $i}\u2019 | sort | uniq \u2013c\r\n<\/pre>\n<p>In the example above, our map function is the awk piece, while the sort and uniq calls represent the reduce function.<\/p>\n<p>We could also write a python script such as the following to do the same thing.<\/p>\n<pre lang=\"python\" line=\"1\">\r\n#!\/usr\/bin\/env python\r\n\r\nwords = []\r\n\r\nfor line in open(\"document.txt\"):\r\n  tmp = line.split()\r\n  for w in tmp:\r\n    words.append(w);\r\n\r\nsortedWords = sorted(words)\r\nlastWord = \"\"\r\ncount = 0\r\n\r\nfor word in sortedWords:\r\n  if word == lastWord:\r\n    count = count + 1\r\n  else:\r\n    print count,lastWord\r\n    lastWord = word\r\n    count = 1\r\n<\/pre>\n<p>The difference between the two is that in the first case, the data was streamed from the map piece to the reduce piece, rather than being processed sequentially.\u00a0 This can be much faster.<\/p>\n<p>If we can partition our workload by some way, then we can also parallelize our work.\u00a0 When we load our list of words, perhaps we can randomly assign a word to a list based on some algorithm.\u00a0 For example, if the word is less than three characters long, it is placed in list A.\u00a0 If it is equal to or longer than three characters, we place it in list B.\u00a0 We can then pass each list to the reducer, which can work on each list and combine them in the end to be passed back to our user.<\/p>\n<p>This is (using a very loose definition) how Google indexes pages it crawls on the internet.\u00a0 Using this methodology, they can index a *lot* of content very quickly.\u00a0 That is where MapReduce is most (and perhaps only) valuable; extremely large datasets.<\/p>\n<p>It can often be difficult to come up with a test case where MapReduce performs significantly better than its sequential equivalent using the same data.\u00a0 This is because there is overhead in the handoff between the map and reduce components.\u00a0 From an Oracle database perspective, it is the same issue that is sometimes found when using its parallel query execution feature.\u00a0 The query has to be one that would take some to complete before its usefulness becomes evident.<\/p>\n<h1>HADOOP<\/h1>\n<p>Hadoop is a set of java classes that implement the model of map reduce.\u00a0 It supplies a distributed filesystem for each node to have the partitioned set of data on which it is assigned to work, as well as the job scheduling between nodes.\u00a0 The developer must simply provide the map and reduce classes to the hadoop infrastructure.\u00a0 These do not have to be written in java.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Map Reduce is a computer technology algorithm that is built on the age old concept of functional programming.\u00a0 The goals of functional programming can be simplistically described as: every program returns a value \u201cmap\u201d lists of values to functions Map&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"http:\/\/appcrawler.com\/wordpress\/2011\/04\/23\/getting-started-with-map-reduce\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"footnotes":""},"categories":[19,21],"tags":[],"_links":{"self":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/1160"}],"collection":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/comments?post=1160"}],"version-history":[{"count":8,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/1160\/revisions"}],"predecessor-version":[{"id":1260,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/1160\/revisions\/1260"}],"wp:attachment":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/media?parent=1160"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/categories?post=1160"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/tags?post=1160"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}