Collaborative Filtering Program: What to do for a Pearson Score When There Isn't Enough Data
- by Mike
I'm building a recommendation engine using collaborative filtering. For similarity scores, I use a Pearson correlation. This is great most of the time, but sometimes I have users that only share a 1 or 2 fields. For example:
User 1{
a: 4
b: 2
}
User 2{
a: 4
b: 3
}
Since this is only 2 data points, a Pearson correlation would always be 1 (a straight line or perfect correlation). This obviously isn't what I want, so what value should I use instead? I could just throw away all instances like this (give a correlation of 0), but my data is really sparse right now and I don't want to lose anything. Is there any similarity score I could use that would fit in with the rest of my similarity scores (all Pearson)?