Gzip in python

We have a large text file containing 325 million integer values that we need to load into a database. The file uncompressed is about 4.5GB. Even though disk is cheap, we didn’t have enough readily available to quickly decompress the file.

As such, I used the python gzip module to decompress the source file on the fly and read it line by line while using cx_Oracle to load the rows. We batch the inserts into 5,000 row chunks to take advantage of bulk loading.

See below:

#!/home/oracle/local/bin/python

import gzip, os, cx_Oracle, sys, string

conn = cx_Oracle.connect("user","password")

rows = []
loops = 0

cursor = conn.cursor()
cursor.prepare("insert into oclc_num_seq(xwcmd_id) values(:1)")

fd = gzip.open(os.path.join("/u02", "missing_find.gz"))
for line in fd:
  tmp = line.strip()
  try:
    rows.append([int(tmp)])
  except:
    print str(sys.exc_info()[1])
  loops = loops + 1
  if loops % 5000 == 0:
    cursor.executemany(None, rows)
    conn.commit()
    del rows[:]
    print loops
    sys.stdout.flush()

#run one last execution to ensure we got the "stragglers" less than 5,000
cursor.executemany(None, rows)
conn.commit()
del rows[:]
print loops
sys.stdout.flush()

We can load about one million rows per minute doing this.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.