I repeat the following idiom again and again. I read from a large file (sometimes, up to 1.2 million records!) and store the output into an SQLite databse. Putting stuff into the SQLite DB seems to be fairly fast.
def readerFunction(recordSize, recordFormat, connection, outputDirectory, outputFile, numObjects):
insertString = "insert into NODE_DISP_INFO(node, analysis, timeStep, H1_translation, H2_translation, V_translation, H1_rotation, H2_rotation, V_rotation) values (?, ?, ?, ?, ?, ?, ?, ?, ?)"
analysisNumber = int(outputPath[-3:])
outputFileObject = open(os.path.join(outputDirectory, outputFile), "rb")
outputFileObject, numberOfRecordsInFileObject = determineNumberOfRecordsInFileObjectGivenRecordSize(recordSize, outputFileObject)
numberOfRecordsPerObject = (numberOfRecordsInFileObject//numberOfObjects)
loop1StartTime = time.time()
for i in range(numberOfRecordsPerObject ):
processedRecords = []
loop2StartTime = time.time()
for j in range(numberOfObjects):
fout = outputFileObject .read(recordSize)
processedRecords.append(tuple([j+1, analysisNumber, i] + [x for x in list(struct.unpack(recordFormat, fout))]))
loop2EndTime = time.time()
print "Time taken to finish loop2: {}".format(loop2EndTime-loop2StartTime)
dbInsertStartTime = time.time()
connection.executemany(insertString, processedRecords)
dbInsertEndTime = time.time()
loop1EndTime = time.time()
print "Time taken to finish loop1: {}".format(loop1EndTime-loop1StartTime)
outputFileObject.close()
print "Finished reading output file for analysis {}...".format(analysisNumber)
When I run the code, it seems that "loop 2" and "inserting into the database" is where most execution time is spent. Average "loop 2" time is 0.003s, but it is run up to 50,000 times, in some analyses. The time spent putting stuff into the database is about the same: 0.004s. Currently, I am inserting into the database every time after loop2 finishes so that I don't have to deal with running out RAM.
What could I do to speed up "loop 2"?