How to count term frequency for set of documents?
Posted
by ManBugra
on Stack Overflow
See other posts from Stack Overflow
or by ManBugra
Published on 2010-05-27T19:08:38Z
Indexed on
2010/05/27
19:11 UTC
Read the original article
Hit count: 359
i have a Lucene-Index with following documents:
doc1 := { caldari, jita, shield, planet }
doc2 := { gallente, dodixie, armor, planet }
doc3 := { amarr, laser, armor, planet }
doc4 := { minmatar, rens, space }
doc5 := { jove, space, secret, planet }
so these 5 documents use 14 different terms:
[ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]
the frequency of each term:
[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]
for easy reading:
[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1,
armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]
What i do want to know now is, how to obtain the term frequency vector for a set of documents?
for example:
Set<Documents> docs := [ doc2, doc3 ]
termFrequencies = magicFunction(docs);
System.out.pring( termFrequencies );
would result in the ouput:
[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1,
armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]
remove all zeros:
[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]
Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.
A naive implementation would be to just iterate over all documents in the
docs
set, create a map and count each term.
But i need a solution that would also work with a document set size of
100.000 or 500.000.
Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?
I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.
© Stack Overflow or respective owner