{"id":5451,"date":"2016-08-23T13:10:15","date_gmt":"2016-08-23T18:10:15","guid":{"rendered":"http:\/\/appcrawler.com\/wordpress\/?p=5451"},"modified":"2016-08-23T13:10:15","modified_gmt":"2016-08-23T18:10:15","slug":"crc-check-on-gzip-files-in-hdfs","status":"publish","type":"post","link":"http:\/\/appcrawler.com\/wordpress\/2016\/08\/23\/crc-check-on-gzip-files-in-hdfs\/","title":{"rendered":"CRC check on gzip files in HDFS"},"content":{"rendered":"<p>I am sure there is a  more elegant way to do this, but I wanted it done quickly.  We had a few files that threw exceptions about the end of the file being reached (CRC failure) for external files underlying a hive table.  There is a fix coming out for this at some point, but for now, this is a workaround to at least identify those problem children&#8230;<\/p>\n<pre>\r\nimport java.util.*;\r\nimport java.util.zip.*;\r\nimport java.util.concurrent.*;\r\nimport java.io.*;\r\nimport org.apache.hadoop.conf.*;\r\nimport org.apache.hadoop.fs.*;\r\nimport org.apache.hadoop.io.compress.*;\r\n\r\npublic class check implements Runnable {\r\n  static ArrayBlockingQueue queue = new ArrayBlockingQueue(1000);\r\n  static FileSystem fs;\r\n  static CompressionCodecFactory factory;\r\n  static Object POISON_PILL = new Object();\r\n  public static void main(String args[]) throws Exception {\r\n    Configuration conf = new Configuration();\r\n    fs = FileSystem.get(conf);\r\n    factory = new CompressionCodecFactory(conf);\r\n    CompressionCodec codec = null;\r\n    try {\r\n      Path file = new Path(args[0]);\r\n      RemoteIterator it = fs.listFiles(file,false);\r\n      String fname = \"\";\r\n      for (int j = 1; j <= 20; j++) {\r\n        check c = new check();\r\n      }\r\n      while (it.hasNext()) {\r\n        try {\r\n          LocatedFileStatus item = (LocatedFileStatus)it.next();\r\n          fname = item.getPath().toString();\r\n          queue.put(item);\r\n          System.out.println(\"put \" + fname);\r\n        }\r\n        catch (Exception e) {\r\n          System.out.println(fname + \" \" + e.getMessage());\r\n        }\r\n      }\r\n      queue.put(POISON_PILL);\r\n      fs.close();\r\n    }\r\n    catch (Exception ezip) {\r\n      ezip.printStackTrace();\r\n    }\r\n  }\r\n\r\n  check() {\r\n    Thread t = new Thread(this);\r\n    t.start();\r\n    try {\r\n      \/\/t.join();\r\n    }\r\n    catch (Exception k) {\r\n    }\r\n  }\r\n\r\n  public void run() {\r\n    try {\r\n      while (true) {\r\n        Object obj = check.queue.take();\r\n        if (obj == check.POISON_PILL) {\r\n          check.queue.add(POISON_PILL);\r\n          break;\r\n        }\r\n        LocatedFileStatus item = (LocatedFileStatus)obj;\r\n        System.out.println(\"processing \" + item.getPath().toString());\r\n        CompressionCodec codec = check.factory.getCodec(item.getPath());\r\n        InputStream stream = null;\r\n\r\n        if (codec != null) {\r\n          stream = codec.createInputStream(check.fs.open(item.getPath()));\r\n        }\r\n        else {\r\n          stream = check.fs.open(item.getPath());\r\n        }\r\n\r\n        String s = \"\";\r\n\r\n        BufferedReader bfr = null;\r\n        bfr = new BufferedReader(new InputStreamReader(stream));\r\n        while((s = bfr.readLine()) != null) {\r\n        }\r\n      }\r\n    }\r\n    catch (Exception e) {\r\n      e.printStackTrace();\r\n    }\r\n  }\r\n}\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I am sure there is a more elegant way to do this, but I wanted it done quickly. We had a few files that threw exceptions about the end of the file being reached (CRC failure) for external files underlying&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"http:\/\/appcrawler.com\/wordpress\/2016\/08\/23\/crc-check-on-gzip-files-in-hdfs\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"footnotes":""},"categories":[17,7],"tags":[],"_links":{"self":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/5451"}],"collection":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/comments?post=5451"}],"version-history":[{"count":1,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/5451\/revisions"}],"predecessor-version":[{"id":5453,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/5451\/revisions\/5453"}],"wp:attachment":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/media?parent=5451"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/categories?post=5451"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/tags?post=5451"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}