Hacker News new | ask | show | jobs
by senderista 422 days ago
Yeah, obviously I wouldn't bother with a heap for k=3. A heap has good compactness but poor locality, so I guess it wouldn't perform well out of (some level of) cache.
1 comments

So quickselect needs multiple passes, and the heap needs O(n log k) time to find the top k elements of n elements total.

However, you can find the top k elements in O(n) time and O(k) space in a single pass.

One simple way: you keep a buffer of up to 2*k elements. You scan your stream of n items one by one. Whenever your buffer gets full, you pare it back down to k elements with your favourite selection algorithm (like quickselect).

As a minor optimisation, you can only add items to your buffer, if they improve on the worst element in your buffer (or when you haven't hit k elements in your buffer, yet).

As an empirical question, you can also experiment with the size of the buffer. Theoretically any multiple of k will do (even 1.1*k or so), but in practice they give you different constant factors for space and time.

How do you efficiently track the "worst element" without something like a max-heap? But yeah, this is a fun algorithm. I think I've seen it before but can't place it, do you remember where you came across it?

  if x > worst then worst = x
Exactly. Keeping a max (or min or sum etc) is easy, if you only ever insert into your structure.

The deletion comes in a big batch, where we don't mind paying a linear cost to prune and rebuild our bucket.

Oh, and in our case it's even simpler: the worst element in our buffer only updates during the pruning phase. By construction, we otherwise only ever insert elements that are better than the worst, so we don't have to update the worst. (Outside of the first k elements that we all take anyway to get started. But if you want, you can handle that as a special case.)

Oh duh, you only evict when pruning the bottom half.