Dear everyone,
I have successfully debugged my own memory leak problems. However, I have noticed some very strange occurence.
for fid, fv in freqDic.iteritems():
outf.write(fid+"\t") #ID
for i, term in enumerate(domain): #Vector
tfidf = self.tf(term, fv) * self.idf( term, docFreqDic)
if i == len(domain) - 1:
outf.write("%f\n" % tfidf)
else:
outf.write("%f\t" % tfidf)
outf.flush()
print "Memory increased by", int(self.memory_mon.usage()) - startMemory
outf.close()
def tf(self, term, freqVector):
total = freqVector[TOTAL]
if total == 0:
return 0
if term not in freqVector: ## When you don't have these lines memory leaks occurs
return 0 ##
return float(freqVector[term]) / freqVector[TOTAL]
def idf(self, term, docFrequencyPerTerm):
if term not in docFrequencyPerTerm:
return 0
return math.log( float(docFrequencyPerTerm[TOTAL])/docFrequencyPerTerm[term])
Basically let me describe my problem:
1) I am doing tfidf calculations
2) I traced that the source of memory leaks is coming from defaultdict.
3) I am using the memory_mon from http://stackoverflow.com/questions/276052/how-to-get-current-cpu-and-ram-usage-in-python
4) The reason for my memory leaks is as follows: a) in self.tf, if the lines: if term not in freqVector: return 0 are not added that will cause the memory leak. (I verified this myself using memory_mon and noticed a sharp increase in memory that kept on increasing)
The solution to my problem was 1) since fv is a defaultdict, any reference to it that are not found in fv will create an entry. Over a very large domain, this will cause memory leaks.
I decided to use dict instead of default dict and the memory problem did go away.
My only puzzle is: since fv is created in "for fid, fv in freqDic.iteritems():" shouldn't fv be destroyed at the end of every for loop? I tried putting gc.collect() at the end of the for loop but gc was not able to collect everything (returns 0). Yes, the hypothesis is right, but the memory should stay fairly consistent with ever for loop if for loops do destroy all temp variables.
This is what it looks like with that two line in self.tf:
Memory increased by 12
Memory increased by 948
Memory increased by 28
Memory increased by 36
Memory increased by 36
Memory increased by 32
Memory increased by 28
Memory increased by 32
Memory increased by 32
Memory increased by 32
Memory increased by 40
Memory increased by 32
Memory increased by 32
Memory increased by 28
and without the the two line:
Memory increased by 1652
Memory increased by 3576
Memory increased by 4220
Memory increased by 5760
Memory increased by 7296
Memory increased by 8840
Memory increased by 10456
Memory increased by 12824
Memory increased by 13460
Memory increased by 15000
Memory increased by 17448
Memory increased by 18084
Memory increased by 19628
Memory increased by 22080
Memory increased by 22708
Memory increased by 24248
Memory increased by 26704
Memory increased by 27332
Memory increased by 28864
Memory increased by 30404
Memory increased by 32856
Memory increased by 33552
Memory increased by 35024
Memory increased by 36564
Memory increased by 39016
Memory increased by 39924
Memory increased by 42104
Memory increased by 42724
Memory increased by 44268
Memory increased by 46720
Memory increased by 47352
Memory increased by 48952
Memory increased by 50428
Memory increased by 51964
Memory increased by 53508
Memory increased by 55960
Memory increased by 56584
Memory increased by 58404
Memory increased by 59668
Memory increased by 61208
Memory increased by 62744
Memory increased by 64400
I look forward to your answer