Approaches for Content-based Item Recommendations

Posted by PartlyCloudy on Stack Overflow See other posts from Stack Overflow or by PartlyCloudy
Published on 2010-12-28T16:07:53Z Indexed on 2011/01/08 6:53 UTC
Read the original article Hit count: 261

Filed under:

I'm currently developing an application where I want to group similar items. Items (like videos) can be created by users and also their attributes can be altered or extended later (like new tags). Instead of relying on users' preferences as most collaborative filtering mechanisms do, I want to compare item similarity based on the items' attributes (like similar length, similar colors, similar set of tags, etc.). The computation is necessary for two main purposes: Suggesting x similar items for a given item and for clustering into groups of similar items.

My application so far is follows an asynchronous design and I want to decouple this clustering component as far as possible. The creation of new items or the addition of new attributes for an existing item will be advertised by publishing events the component can then consume.

Computations can be provided best-effort and "snapshotted", which means that I'm okay with the best result possible at a given point in time, although result quality will eventually increase.

So I am now searching for appropriate algorithms to compute both similar items and clusters. At important constraint is scalability. Initially the application has to handle a few thousand items, but later million items might be possible as well. Of course, computations will then be executed on additional nodes, but the algorithm itself should scale. It would also be nice if the algorithm supports some kind of incremental mode on partial changes of the data.

My initial thought of comparing each item with each other and storing the numerical similarity sounds a little bit crude. Also, it requires n*(n-1)/2 entries for storing all similarities and any change or new item will eventually cause n similarity computations.

Thanks in advance!

UPDATE tl;dr

To clarify what I want, here is my targeted scenario:

User generate entries (think of documents)
User edit entry meta data (think of tags)

And here is what my system should provide:

List of similar entries to a given item as recommendation
Clusters of similar entries

Both calculations should be based on:

The meta data/attributes of entries (i.e. usage of similar tags)
Thus, the distance of two entries using appropriate metrics
NOT based on user votings, preferences or actions (unlike collaborative filtering). Although users may create entries and change attributes, the computation should only take into account the items and their attributes, and not the users associated with (just like a system where only items and no users exist).

Ideally, the algorithm should support:

permanent changes of attributes of an entry
incrementally compute similar entries/clusters on changes
scale
something better than a simple distance table, if possible (because of the O(n²) space complexity)

Developer IT

Approaches for Content-based Item Recommendations - Developer IT

Approaches for Content-based Item Recommendations

clustering

recommendation

collaborative-filtering

collective-intelligence

Related posts about clustering

agglomerative clustering java

MySQL Clustering in a Sandbox

Clustering for Mere Mortals (Pt2)

Microsoft SQL Server High-Availability Videos and Q&A Log

I need advice about iscsi + zfs(or ntfs) + windows 2008 clustering

Related posts about recommendation

HTML5: Difference between the "proposed recommendation" and the "candidate recommendation"

What's the best C# recommendation engine or framework?

SQL Server 2008 Hardware Recommendation;

International multi-OS keyboard layout for both coding and surfing?

Software for video subscription service

Categories cloud