Document Similarity: Comparing two documents efficiently
Posted
by seanieb
on Stack Overflow
See other posts from Stack Overflow
or by seanieb
Published on 2010-03-13T10:24:55Z
Indexed on
2010/03/13
10:35 UTC
Read the original article
Hit count: 333
I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries
This is what I have so far, it works, but is super slow:
# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
#convert tuple to a dictionary
doca_dic = dict((row[0], row[1]) for row in doca)
#Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
#convert tuple to a dictionary
docb_dic = dict((row[0], row[1]) for row in docb)
# loop through each token in doca and see if one matches in docb
for x in doca_dic:
if docb_dic.has_key(x):
#calculate the similarity by summing the products of the tf-idf_norm
similarity += doca_dic[x] * docb_dic[x]
print "similarity"
print similarity
I'm pretty new to Python, hence this mess. I need to speed it up, any help would be appreciated. Thanks.
© Stack Overflow or respective owner