{"id":4840,"date":"2015-04-10T07:52:30","date_gmt":"2015-04-10T12:52:30","guid":{"rendered":"http:\/\/appcrawler.com\/wordpress\/?p=4840"},"modified":"2015-04-10T08:10:30","modified_gmt":"2015-04-10T13:10:30","slug":"getting-additional-data-on-a-slow-hive-reducer","status":"publish","type":"post","link":"http:\/\/appcrawler.com\/wordpress\/2015\/04\/10\/getting-additional-data-on-a-slow-hive-reducer\/","title":{"rendered":"Getting additional data on a slow hive reducer"},"content":{"rendered":"<p>We have a query that has been running for almost two hours on the last of 1,009 reducers in the first reduce phase.  <\/p>\n<p><img alt='' class='alignnone size-full wp-image-4846 ' src='http:\/\/appcrawler.com\/wordpress\/wp-content\/uploads\/2015\/04\/img_5527c7035030d.png' \/><\/p>\n<p>We wanted to gain additional insight into what it was doing, so we first find the node on which the reducer is running.  There are probably more elegant ways to do this, but this works.  We base our guess on the node using the most CPU.<\/p>\n<pre>\r\n[root@cmhlpdlkedat15 hive]# for i in {01..12}; do ssh cmhlpdlkedat${i} sar -s 08:00:00 | grep Average; done\r\nAverage:        all     53.15      0.00      1.49      0.03      0.00     45.34\r\nAverage:        all      1.68      0.00      1.07      0.05      0.00     97.20\r\nAverage:        all      1.60      0.00      0.94      0.05      0.00     97.41\r\nAverage:        all      1.68      0.00      1.00      0.05      0.00     97.27\r\nAverage:        all      1.33      0.00      0.82      0.04      0.00     97.80\r\nAverage:        all      1.66      0.00      0.94      0.05      0.00     97.35\r\nAverage:        all      1.79      0.00      0.90      0.05      0.00     97.27\r\nAverage:        all      2.89      0.00      1.29      0.06      0.00     95.76\r\nAverage:        all      1.82      0.00      1.07      0.05      0.00     97.06\r\nAverage:        all      1.82      0.00      0.93      0.71      0.00     96.54\r\nAverage:        all      1.48      0.00      0.99      0.04      0.00     97.50\r\nAverage:        all      1.56      0.00      0.86      0.04      0.00     97.54\r\n[root@cmhlpdlkedat15 hive]#\r\n<\/pre>\n<p>We then log into that server, and find the thread using the most CPU&#8230;<\/p>\n<pre>\r\n[root@cmhlpdlkedat01 ~]# ps -eLf | awk '{print $2,$4,$9}' | sort -k +3 | tail -5;sleep 5;ps -eLf | awk '{print $2,$4,$9}' | sort -k +3 | tail -5\r\n1795 1862 00:36:28\r\n13649 13656 00:59:33\r\n1795 1808 01:02:08\r\n3393 3412 01:18:38\r\nPID LWP TIME\r\n1795 1862 00:36:28\r\n13649 13656 00:59:33\r\n1795 1808 01:02:08\r\n3393 3412 01:18:42\r\nPID LWP TIME\r\n<\/pre>\n<p>After five seconds, we see it is PID 3412&#8230;<\/p>\n<pre>\r\n[root@cmhlpdlkedat01 ~]# ps -eLf | grep 3412\r\nyarn      3393  3380  3412 79   26 06:58 ?        01:18:48 \/usr\/jdk64\/jdk1.7.0_45\/bin\/java -server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.2.0.0-2041 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -server -Xmx410m -Djava.net.preferIPv4Stack=true -XX:NewRatio=8 -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=\/mnt\/hdfs\/hadoop\/yarn\/log\/application_1428633698169_0025\/container_1428633698169_0025_01_000004 -Dtez.root.logger=INFO,CLA -Djava.io.tmpdir=\/mnt\/hdfs\/hadoop\/yarn\/local\/usercache\/root\/appcache\/application_1428633698169_0025\/container_1428633698169_0025_01_000004\/tmp org.apache.tez.runtime.task.TezChild 172.27.2.57 43015 container_1428633698169_0025_01_000004 application_1428633698169_0025 1\r\nroot     27426 26643 27426  0    1 08:37 pts\/0    00:00:00 grep 3412\r\n<\/pre>\n<p>We then trace the system calls on this PID, looking only for read calls&#8230;<\/p>\n<pre>\r\n[root@cmhlpdlkedat01 ~]# strace -p 3412 -o t.txt -e trace=read\r\n<\/pre>\n<p>We see that file descriptor 2231 is being read, and we can even see dates in the input (2015-03-24)&#8230;<\/p>\n<pre>\r\n[root@cmhlpdlkedat01 ~]# head -20 t.txt\r\nread(2231, \"2015-03-24\\0\\0\\0\\27\\0\\0\\0\\4\\0\\0\\0\\0\\0\\0\\0\\17\\3\\0\\1\\0\\n2\"..., 131072) = 131072\r\nread(2232, \"\\340&\\17\\350\\243\\240\\371\\271\\352d\\217\\7\\373\\245zLW$z\\25\\263m\\207\\313\\226u\\342\\16\\\\\\242\\377s\"..., 4096) = 4096\r\nread(2231, \"\\4\\0\\0\\0\\0\\0\\0\\0\\20\\3\\2\\2\\24\\f\\n2014-12-03\\0\\0\\0\\30\\0\\0\\0\"..., 131072) = 131072\r\nread(2231, \"15-03-11\\0\\0\\0\\30\\0\\0\\0\\4\\0\\0\\0\\0\\0\\0\\0\\20\\3\\2\\2\\31\\242\\n20\"..., 131072) = 131072\r\n<\/pre>\n<p>&#8230;so we find the actual filename for this descriptor, and tail it to see it is the larger of the two tables&#8230;<\/p>\n<pre>\r\n[root@cmhlpdlkedat01 ~]# ls -lrt \/proc\/3412\/fd | grep 2231\r\nlrwx------ 1 yarn hadoop 64 Apr 10 08:32 429 -> socket:[34622312]\r\nlr-x------ 1 yarn hadoop 64 Apr 10 08:47 2231 -> \/mnt\/hdfs\/hadoop\/yarn\/local\/usercache\/root\/appcache\/application_1428633698169_0025\/container_1428633698169_0025_01_000004\/tmp\/hive-rowcontainer7779433895197233164\/RowContainer4530514485187724697.tmp\r\n[root@cmhlpdlkedat01 ~]# tail -f \/mnt\/hdfs\/hadoop\/yarn\/local\/usercache\/root\/appcache\/application_1428633698169_0025\/container_1428633698169_0025_01_000004\/tmp\/hive-rowcontainer7779433895197233164\/RowContainer4530514485187724697.tmp\r\n2014-05-16T\r\n2014-05-16\u00a6\r\n2014-05-16\u00a6?\r\n2014-03-13\u00a6\r\n2014-03-13\r\n2014-03-13\u00a6\r\n2014-03-13V\r\n2014-03-13\r\n2014-03-13\r\n          q\r\n2014-03-13\r\n<\/pre>\n<p>We then decide to get a thread dump for the JVM.  First of all, notice the name of the JVM is TezChild, which is a tip off that you are runnbing hive code on a given server.  Again, there are other ways, this is just one of them.<\/p>\n<pre>\r\n[root@cmhlpdlkedat01 yarn]# su - yarn\r\n[yarn@cmhlpdlkedat01 ~]$ jps\r\n16082 NodeManager\r\n663 Jps\r\n3393 TezChild\r\n[yarn@cmhlpdlkedat01 ~]$ printf \"%x\\n\" 3412\r\nd54\r\n[yarn@cmhlpdlkedat01 ~]$ jstack 3393 | awk '{if ($0 ~ \"0xd54\") {i = 1;print $0} else if (i == 1 && $0 != \"\") {print $0} else if (i == 1 && $0 == \"\") {exit}}'\r\n\"TezChild\" daemon prio=10 tid=0x00007f5e886c8000 nid=0xd54 runnable [0x00007f5e7d17c000]\r\n   java.lang.Thread.State: RUNNABLE\r\n        at java.io.DataInputStream.readFully(DataInputStream.java:195)\r\n        at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:180)\r\n        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)\r\n        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)\r\n        at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2245)\r\n        at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2218)\r\n        - locked <0x00000000f68a80f0> (a org.apache.hadoop.io.SequenceFile$Reader)\r\n        at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)\r\n        - locked <0x00000000f68a80c8> (a org.apache.hadoop.mapred.SequenceFileRecordReader)\r\n        at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)\r\n        - locked <0x00000000f68a80c8> (a org.apache.hadoop.mapred.SequenceFileRecordReader)\r\n        at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.nextBlock(RowContainer.java:360)\r\n        at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.next(RowContainer.java:267)\r\n        at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.next(RowContainer.java:74)\r\n        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:644)\r\n        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:654)\r\n        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:750)\r\n        at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:248)\r\n        at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:213)\r\n        at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.processOp(CommonMergeJoinOperator.java:196)\r\n        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:328)\r\n        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:258)\r\n        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:168)\r\n        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:163)\r\n        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)\r\n        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)\r\n        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176)\r\n        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)\r\n        at java.security.AccessController.doPrivileged(Native Method)\r\n        at javax.security.auth.Subject.doAs(Subject.java:415)\r\n        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)\r\n        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)\r\n        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)\r\n        at java.util.concurrent.FutureTask.run(FutureTask.java:262)\r\n        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\r\n        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\r\n        at java.lang.Thread.run(Thread.java:744)\r\n[yarn@cmhlpdlkedat01 ~]$\r\n<\/pre>\n<p>If we again run the thread dump, we see a different call stack, with very interesting names for the classes and methods&#8230;<\/p>\n<pre>\r\n[yarn@cmhlpdlkedat01 ~]$ jstack 3393 | awk '{if ($0 ~ \"0xd54\") {i = 1;print $0} else if (i == 1 && $0 != \"\") {print $0} else if (i == 1 && $0 == \"\") {exit}}'\r\n\"TezChild\" daemon prio=10 tid=0x00007f5e886c8000 nid=0xd54 runnable [0x00007f5e7d17b000]\r\n   java.lang.Thread.State: RUNNABLE\r\n        at sun.nio.cs.UTF_8.updatePositions(UTF_8.java:78)\r\n        at sun.nio.cs.UTF_8$Encoder.encodeArrayLoop(UTF_8.java:564)\r\n        at sun.nio.cs.UTF_8$Encoder.encodeLoop(UTF_8.java:619)\r\n        at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:561)\r\n        at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:783)\r\n        at org.apache.hadoop.io.Text.encode(Text.java:450)\r\n        at org.apache.hadoop.io.Text.set(Text.java:198)\r\n        at org.apache.hadoop.hive.ql.udf.UDFSubstr.evaluate(UDFSubstr.java:78)\r\n        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)\r\n        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\r\n        at java.lang.reflect.Method.invoke(Method.java:606)\r\n        at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1235)\r\n        at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:182)\r\n        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:185)\r\n        at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)\r\n        at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)\r\n        at org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.getNewKey(KeyWrapperFactory.java:113)\r\n        at org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:784)\r\n        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)\r\n        at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)\r\n        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)\r\n        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:638)\r\n        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:651)\r\n        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:654)\r\n        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:750)\r\n        at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:248)\r\n        at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:213)\r\n        at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.processOp(CommonMergeJoinOperator.java:196)\r\n        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:328)\r\n        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:258)\r\n        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:168)\r\n        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:163)\r\n        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)\r\n        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)\r\n        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176)\r\n        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)\r\n        at java.security.AccessController.doPrivileged(Native Method)\r\n        at javax.security.auth.Subject.doAs(Subject.java:415)\r\n        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)\r\n        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)\r\n        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)\r\n        at java.util.concurrent.FutureTask.run(FutureTask.java:262)\r\n        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\r\n        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\r\n        at java.lang.Thread.run(Thread.java:744)\r\n[yarn@cmhlpdlkedat01 ~]$\r\n<\/pre>\n<p>Not an answer, but it is additional info.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We have a query that has been running for almost two hours on the last of 1,009 reducers in the first reduce phase. We wanted to gain additional insight into what it was doing, so we first find the node&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"http:\/\/appcrawler.com\/wordpress\/2015\/04\/10\/getting-additional-data-on-a-slow-hive-reducer\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"footnotes":""},"categories":[19,21,43],"tags":[],"_links":{"self":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/4840"}],"collection":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/comments?post=4840"}],"version-history":[{"count":9,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/4840\/revisions"}],"predecessor-version":[{"id":4850,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/4840\/revisions\/4850"}],"wp:attachment":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/media?parent=4840"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/categories?post=4840"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/tags?post=4840"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}