Document Similarity: Comparing two documents efficiently

Posted by seanieb on Stack Overflow See other posts from Stack Overflow or by seanieb
Published on 2010-03-13T10:24:55Z Indexed on 2010/03/13 10:35 UTC
Read the original article Hit count: 376

Filed under:

python

|

mysql

|

speed

|

efficent

I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries

This is what I have so far, it works, but is super slow:

# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
#convert tuple to a dictionary
doca_dic = dict((row[0], row[1]) for row in doca)

#Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
#convert tuple to a dictionary
docb_dic = dict((row[0], row[1]) for row in docb)

# loop through each token in doca and see if one matches in docb
for x in doca_dic:
    if docb_dic.has_key(x):
        #calculate the similarity by summing the products of the tf-idf_norm 
        similarity += doca_dic[x] * docb_dic[x]
print "similarity"
print similarity

I'm pretty new to Python, hence this mess. I need to speed it up, any help would be appreciated. Thanks.

© Stack Overflow or respective owner

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More

Related posts about mysql

How to remove MySQL completely with config and library files on ubuntu 12.04 gnome 3.0

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried everything till now: sudo apt-get remove mysql-server mysql-client mysql-common sudo apt-get purge mysql-server mysql-client mysql-common sudo apt-get autoremove and even more commands... But whenever I am trying to locate mysql. I get a no. of files related to mysql command: shell>>… >>> More
mysql: Cannot load from mysql.proc. The table is probably corrupted

as seen on Server Fault - Search for 'Server Fault'
Mysql was started: /usr/bin/mysqld_safe --datadir=/srv/mysql/myDB --log-error=/srv/mysql/logs/mysqld-myDB.log --pid-file=/srv/mysql/pids/mysqld-myDB.pid --user=mysql --socket=/srv/mysql/sockets/mysql-myDB.sock --port=3700 but when I'm trying to do something: ERROR 1548 (HY000) at line 1: Cannot… >>> More
Why is there a /etc/init.d/mysql file on this Slackware machine? How could it have gotten there?

as seen on Server Fault - Search for 'Server Fault'
A client of my IT-consulting service owns a web-development shop. He's been having problems with a Slackware 12.0 server running MySQL 5.0.67. The machine was set up by the client's sysadmin, who left on bad terms. My client no longer employs a sysadmin. As far as I can tell, the only copy of MySQL… >>> More
mysql: Bind on unix socket: Permission denied

as seen on Server Fault - Search for 'Server Fault'
Can't start mysql with: sudo /usr/bin/mysqld_safe --datadir=/srv/mysql/myDB --log-error=/srv/mysql/logs/mysqld-myDB.log --pid-file=/srv/mysql/pids/mysqld-myDB.pid --user=mysql --socket=/srv/mysql/sockets/mysql-myDB.sock --port=3700 120222 13:40:48 mysqld_safe Starting mysqld daemon with databases… >>> More
MySQL – Learning MySQL Online in 6 Hours – MySQL Fundamentals in 320 Minutes

as seen on SQL Authority - Search for 'SQL Authority'
MySQL is one of the most popular database language and I have been recently working with it a lot. Data have no barrier and every database have their own place. I have been working with MySQL for quite a while and just like SQL Server, I often find lots of people asking me if I have a tutorial which… >>> More