{"id":1024,"date":"2011-01-26T13:49:29","date_gmt":"2011-01-26T18:49:29","guid":{"rendered":"http:\/\/appcrawler.com\/wordpress\/?p=1024"},"modified":"2011-07-06T09:42:36","modified_gmt":"2011-07-06T14:42:36","slug":"gzip-in-python","status":"publish","type":"post","link":"http:\/\/appcrawler.com\/wordpress\/2011\/01\/26\/gzip-in-python\/","title":{"rendered":"Gzip in python"},"content":{"rendered":"<p>We have a large text file containing 325 million integer values that we need to load into a database.  The file uncompressed is about 4.5GB.  Even though disk is cheap, we didn&#8217;t have enough readily available to quickly decompress the file.<\/p>\n<p>As such, I used the python gzip module to decompress the source file on the fly and read it line by line while using cx_Oracle to load the rows.  We batch the inserts into 5,000 row chunks to take advantage of bulk loading.<\/p>\n<p>See below:<\/p>\n<pre lang=\"python\" line=\"1\">\r\n#!\/home\/oracle\/local\/bin\/python\r\n\r\nimport gzip, os, cx_Oracle, sys, string\r\n\r\nconn = cx_Oracle.connect(\"user\",\"password\")\r\n\r\nrows = []\r\nloops = 0\r\n\r\ncursor = conn.cursor()\r\ncursor.prepare(\"insert into oclc_num_seq(xwcmd_id) values(:1)\")\r\n\r\nfd = gzip.open(os.path.join(\"\/u02\", \"missing_find.gz\"))\r\nfor line in fd:\r\n  tmp = line.strip()\r\n  try:\r\n    rows.append([int(tmp)])\r\n  except:\r\n    print str(sys.exc_info()[1])\r\n  loops = loops + 1\r\n  if loops % 5000 == 0:\r\n    cursor.executemany(None, rows)\r\n    conn.commit()\r\n    del rows[:]\r\n    print loops\r\n    sys.stdout.flush()\r\n\r\n#run one last execution to ensure we got the \"stragglers\" less than 5,000\r\ncursor.executemany(None, rows)\r\nconn.commit()\r\ndel rows[:]\r\nprint loops\r\nsys.stdout.flush()\r\n<\/pre>\n<p>We can load about one million rows per minute doing this.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We have a large text file containing 325 million integer values that we need to load into a database. The file uncompressed is about 4.5GB. Even though disk is cheap, we didn&#8217;t have enough readily available to quickly decompress the&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"http:\/\/appcrawler.com\/wordpress\/2011\/01\/26\/gzip-in-python\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"footnotes":""},"categories":[19,24,22,26],"tags":[],"_links":{"self":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/1024"}],"collection":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/comments?post=1024"}],"version-history":[{"count":8,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/1024\/revisions"}],"predecessor-version":[{"id":1284,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/1024\/revisions\/1284"}],"wp:attachment":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/media?parent=1024"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/categories?post=1024"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/tags?post=1024"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}