Very fast document similarity

Posted by peyton on Stack Overflow See other posts from Stack Overflow or by peyton
Published on 2010-05-13T18:23:30Z Indexed on 2010/05/13 18:34 UTC
Read the original article Hit count: 199

Filed under:
|
|
|

Hello,

I am trying to determine document similarity between a single document and each of a large number of documents (n ~= 1 million) as quickly as possible. More specifically, the documents I'm comparing are e-mails; they are grouped (i.e., there are folders or tags) and I'd like to determine which group is most appropriate for a new e-mail. Fast performance is critical.

My a priori assumption is that the cosine similarity between term vectors is appropriate for this application; please comment on whether this is a good measure to use or not!

I have already taken into account the following possibilities for speeding up performance:

  1. Pre-normalize all the term vectors

  2. Calculate a term vector for each group (n ~= 10,000) rather than each e-mail (n ~= 1,000,000); this would probably be acceptable for my application, but if you can think of a reason not to do it, let me know!

I have a few questions:

  1. If a new e-mail has a new term never before seen in any of the previous e-mails, does that mean I need to re-compute all of my term vectors? This seems expensive.

  2. Is there some clever way to only consider vectors which are likely to be close to the query document?

  3. Is there some way to be more frugal about the amount of memory I'm using for all these vectors?

Thanks!

© Stack Overflow or respective owner

Related posts about cosine

Related posts about search