|
|
|
|
|
by hywel
3967 days ago
|
|
The actual rating is a tiny part of the data per movie, so there's not much saving there. And clustering would have to be done instead of indexing by movie / user, so it would probably make performance worse overall. Indexing by movie / user is done exactly for the reason of using the cache efficiently. Unfortunately, you have to iterate through both movies and users, so you either store the sparse matrix twice (once movie-indexed, once user-indexed) OR you deal with lots of cache misses half the time. And, yes, all the values are stored 0-based for exactly that reason :) It's an even bigger saving for storing timestamps. Not sure what you mean about the prime solution not scaling well - 3 primes of ~2^20 can be stored in ~2^60 (i.e. within 8 bytes) as opposed to within 3 4-byte integers. When it really sucks is when you're storing lots of small integers, e.g. 20 things in [0,1,2,3] - that gets very inefficient fast, and it'd be much more efficient to use normal bitfields. |
|