|
|
|
|
|
by stochastic_monk
2513 days ago
|
|
What’s your top-k algorithm? I’ve used count sketches and heaps (I read heavy-keeper strictly outperforms count/count-min). I’ve also been told some variant of a hierarchical count sketch provides composability, but I’m not sure what that is. (Actually, how do you compose? Is it the feature vector a union of all top-k features?) |
|
Most of the work we actually don't use any fancy sketching. Just a GIANT table where we count hashes, and ignore collisions, making it super fast to count. Then we just use Quick-Select to find the top-k counts out of 2^31 buckets.
Second phase we technically use the Space-Saving algorithm, but its generally not needed - we could just use a hash-map and naively count (which we know is actually safe is the data is truly power-law distributed). The Space-Saving algo gives us some greater peace-of-mind though since real data doesn't have to obey our theories :)
Features are just bag-of-words style!