I'm doing some research and I'm playing with Apache Mahout 0.6
My purpose is to build a system which will name different categories of documents based on user input. The documents are not known in advance and I don't know also which categories do I have while collecting these documents. But I do know, that all the documents in the model should belong to one of the predefined categories.
For example:
Lets say I've collected a N documents, that belong to 3 different groups :
Politics
Madonna (pop-star)
Science fiction
I don't know what document belongs to what category, but I know that each one of my N documents belongs to one of those categories (e.g. there are no documents about, say basketball among these N docs)
So, I came up with the following idea:
Apply mahout clustering (for example k-mean with k=3 on these documents)
This should divide the N documents to 3 groups. This should be kind of my model to learn with. I still don't know which document really belongs to which group, but at least the documents are clustered now by group
Ask the user to find any document in the web that should be about 'Madonna' (I can't show to the user none of my N documents, its a restriction). Then I want to measure 'similarity' of this document and each one of 3 groups.
I expect to see that the measurement for similarity between user_doc and documents in Madonna group in the model will be higher than the similarity between the user_doc and documents about politics.
I've managed to produce the cluster of documents using 'Mahout in Action' book.
But I don't understand how should I use Mahout to measure similarity between the 'ready' cluster group of document and one given document.
I thought about rerunning the cluster with k=3 for N+1 documents with the same centroids (in terms of k-mean clustering) and see whether where the new document falls, but maybe there is any other way to do that?
Is it possible to do with Mahout or my idea is conceptually wrong? (example in terms of Mahout API would be really good)
Thanks a lot and sorry for a long question (couldn't describe it better)
Any help is highly appreciated
P.S. This is not a home-work project :)