Hacker News new | ask | show | jobs
by stochastic_monk 2513 days ago
What’s your top-k algorithm? I’ve used count sketches and heaps (I read heavy-keeper strictly outperforms count/count-min). I’ve also been told some variant of a hierarchical count sketch provides composability, but I’m not sure what that is.

(Actually, how do you compose? Is it the feature vector a union of all top-k features?)

1 comments

Top-k for which phase?

Most of the work we actually don't use any fancy sketching. Just a GIANT table where we count hashes, and ignore collisions, making it super fast to count. Then we just use Quick-Select to find the top-k counts out of 2^31 buckets.

Second phase we technically use the Space-Saving algorithm, but its generally not needed - we could just use a hash-map and naively count (which we know is actually safe is the data is truly power-law distributed). The Space-Saving algo gives us some greater peace-of-mind though since real data doesn't have to obey our theories :)

Features are just bag-of-words style!