Hacker News new | ask | show | jobs
by alanctgardner3 4378 days ago
Another commenter pointed this out, but what you're trying to compute is cosine similarity, in which case you're missing the normalizing part in the denominator (the product of the magnitude of both vectors). In other words, two items which both occur frequently will score higher than two items which occur infrequently, but which co-occur higher than usual. This leads to a tendency to over-recommend popular items.

When you were on EMR, you could have used Mahout's distributed collaborative filtering, which has the benefits of being correct, and requiring zero coding.

Wikipedia explains here: http://en.wikipedia.org/wiki/Cosine_similarity

1 comments

Thanks for the tip, I'll look into this more.