Hello,
I'm currently developing an application where I want to group similar items. Items (like videos) can be created by users and also their attributes can be altered or extended later (like new tags). Instead of relying on users' preferences as most collaborative filtering mechanisms do, I want to compare item similarity based on the items' attributes (like similar length, similar colors, similar set of tags, etc.). The computation is necessary for two main purposes: Suggesting x similar items for a given item and for clustering into groups of similar items.
My application so far is follows an asynchronous design and I want to decouple this clustering component as far as possible. The creation of new items or the addition of new attributes for an existing item will be advertised by publishing events the component can then consume.
Computations can be provided best-effort and "snapshotted", which means that I'm okay with the best result possible at a given point in time, although result quality will eventually increase.
So I am now searching for appropriate algorithms to compute both similar items and clusters. At important constraint is scalability. Initially the application has to handle a few thousand items, but later million items might be possible as well. Of course, computations will then be executed on additional nodes, but the algorithm itself should scale. It would also be nice if the algorithm supports some kind of incremental mode on partial changes of the data.
My initial thought of comparing each item with each other and storing the numerical similarity sounds a little bit crude. Also, it requires n*(n-1)/2 entries for storing all similarities and any change or new item will eventually cause n similarity computations.
Thanks in advance!
UPDATE tl;dr
To clarify what I want, here is my targeted scenario:
User generate entries (think of documents)
User edit entry meta data (think of tags)
And here is what my system should provide:
List of similar entries to a given item as recommendation
Clusters of similar entries
Both calculations should be based on:
The meta data/attributes of entries (i.e. usage of similar tags)
Thus, the distance of two entries using appropriate metrics
NOT based on user votings, preferences or actions (unlike collaborative filtering). Although users may create entries and change attributes, the computation should only take into account the items and their attributes, and not the users associated with (just like a system where only items and no users exist).
Ideally, the algorithm should support:
permanent changes of attributes of an entry
incrementally compute similar entries/clusters on changes
scale
something better than a simple distance table, if possible (because of the O(n²) space complexity)