We have a large text file containing 325 million integer values that we need to load into a database. The file uncompressed is about 4.5GB. Even though disk is cheap, we didn’t have enough readily available to quickly decompress the file.
As such, I used the python gzip module to decompress the source file on the fly and read it line by line while using cx_Oracle to load the rows. We batch the inserts into 5,000 row chunks to take advantage of bulk loading.
See below:
#!/home/oracle/local/bin/python
import gzip, os, cx_Oracle, sys, string
conn = cx_Oracle.connect("user","password")
rows = []
loops = 0
cursor = conn.cursor()
cursor.prepare("insert into oclc_num_seq(xwcmd_id) values(:1)")
fd = gzip.open(os.path.join("/u02", "missing_find.gz"))
for line in fd:
tmp = line.strip()
try:
rows.append([int(tmp)])
except:
print str(sys.exc_info()[1])
loops = loops + 1
if loops % 5000 == 0:
cursor.executemany(None, rows)
conn.commit()
del rows[:]
print loops
sys.stdout.flush()
#run one last execution to ensure we got the "stragglers" less than 5,000
cursor.executemany(None, rows)
conn.commit()
del rows[:]
print loops
sys.stdout.flush()
We can load about one million rows per minute doing this.