Hacker News new | ask | show | jobs
by amluto 422 days ago
To be fair to quickselect, I can imagine a lazy data processing framework having a concept of a lazily sorted data column where the actual data has been materialized but it’s not in sorted order yet. Then someone does “LIMIT k” to it, and the framework can go to town with quickselect.

As noted a couple times in this thread, there are all kinds of tradeoffs here, and I can’t imagine quickselect being even close to competitive for k that is small enough to fit in cache. Quickselect will, in general, scan a large input approximately twice. For k = 3, the answer fits in general-purpose registers or even in a single SIMD register, and a single scan with brute force accumulation of the answer will beat quickselect handily and will also beat any sort of log-time heap.

(In general, more advanced and asymptotically better algorithms often lose to simpler brute force algorithms when the parameters in question are smallish.)

1 comments

Yeah, obviously I wouldn't bother with a heap for k=3. A heap has good compactness but poor locality, so I guess it wouldn't perform well out of (some level of) cache.
So quickselect needs multiple passes, and the heap needs O(n log k) time to find the top k elements of n elements total.

However, you can find the top k elements in O(n) time and O(k) space in a single pass.

One simple way: you keep a buffer of up to 2*k elements. You scan your stream of n items one by one. Whenever your buffer gets full, you pare it back down to k elements with your favourite selection algorithm (like quickselect).

As a minor optimisation, you can only add items to your buffer, if they improve on the worst element in your buffer (or when you haven't hit k elements in your buffer, yet).

As an empirical question, you can also experiment with the size of the buffer. Theoretically any multiple of k will do (even 1.1*k or so), but in practice they give you different constant factors for space and time.

How do you efficiently track the "worst element" without something like a max-heap? But yeah, this is a fun algorithm. I think I've seen it before but can't place it, do you remember where you came across it?

  if x > worst then worst = x
Exactly. Keeping a max (or min or sum etc) is easy, if you only ever insert into your structure.

The deletion comes in a big batch, where we don't mind paying a linear cost to prune and rebuild our bucket.

Oh, and in our case it's even simpler: the worst element in our buffer only updates during the pruning phase. By construction, we otherwise only ever insert elements that are better than the worst, so we don't have to update the worst. (Outside of the first k elements that we all take anyway to get started. But if you want, you can handle that as a special case.)

Oh duh, you only evict when pruning the bottom half.