Hacker News new | ask | show | jobs
by amluto 423 days ago
> I guess sorting 150 million integers in 70ms shouldn't be surprising.

I find sorting 150M integers at all to be surprising. The query asks for finding the top 3 elements and returning those elements, sorted. This can be done trivially by keeping the best three found so far and scanning the list. This should operate at nearly the speed of memory and use effectively zero additional storage. I don’t know whether Clickhouse does this optimization, but I didn’t see it mentioned.

Generically, one can find the kth best of n elements in time O(n):

https://en.m.wikipedia.org/wiki/Selection_algorithm

And one can scan again to find the top k, plus some extra if the kth best wasn’t unique, but that issue is manageable and, I think, adds at most a factor of 2 overhead if one is careful (collect up to k elements that compare equal to the kth best and collect up to k that are better than it). Total complexity is O(n) if you don’t need the result sorted or O(n + k log k) if you do.

If you’re not allowed to mutate the input (which probably applies to Clickhouse-style massive streaming reads), you can collect the top k in a separate data structure, and straightforward implementations are O(n log k). I wouldn’t be surprised if using a fancy heap or taking advantage of the data being integers with smallish numbers of bits does better, but I haven’t tried to find a solution or disprove the existence of one.

4 comments

I am the author of the optimization of partial sorting and selection in Clickhouse. It uses Floyd-Rivest algorithm and we tried a lot of different things back at the time, read [1]

Overall clickhouse reads blocks of fixed sizes (64k) and finds top elements and then does top of the top until it converges.

[1] https://danlark.org/2020/11/11/miniselect-practical-and-gene...

> This can be done trivially by keeping the best three found so far and scanning the list.

That doesnt seem to guarantee correctness. If you dont track all of the unique values, at least, you could be throwing away one of the most common values.

The wiki entry seems to be specifically about the smallest, rather than largest values.

The max-heap algorithm alluded to above is correct. You fill it with the first k values scanned, then peek at the max element for each subsequent value. If the current value is smaller than the max element, you evict the max element and insert the new element. This streaming top-k algorithm is ubiquitous in both leetcode interviews and applications. (The standard quickselect top-k algorithm is not useful in the streaming context because it requires random access and in-place mutation.)
To be fair to quickselect, I can imagine a lazy data processing framework having a concept of a lazily sorted data column where the actual data has been materialized but it’s not in sorted order yet. Then someone does “LIMIT k” to it, and the framework can go to town with quickselect.

As noted a couple times in this thread, there are all kinds of tradeoffs here, and I can’t imagine quickselect being even close to competitive for k that is small enough to fit in cache. Quickselect will, in general, scan a large input approximately twice. For k = 3, the answer fits in general-purpose registers or even in a single SIMD register, and a single scan with brute force accumulation of the answer will beat quickselect handily and will also beat any sort of log-time heap.

(In general, more advanced and asymptotically better algorithms often lose to simpler brute force algorithms when the parameters in question are smallish.)

Yeah, obviously I wouldn't bother with a heap for k=3. A heap has good compactness but poor locality, so I guess it wouldn't perform well out of (some level of) cache.
So quickselect needs multiple passes, and the heap needs O(n log k) time to find the top k elements of n elements total.

However, you can find the top k elements in O(n) time and O(k) space in a single pass.

One simple way: you keep a buffer of up to 2*k elements. You scan your stream of n items one by one. Whenever your buffer gets full, you pare it back down to k elements with your favourite selection algorithm (like quickselect).

As a minor optimisation, you can only add items to your buffer, if they improve on the worst element in your buffer (or when you haven't hit k elements in your buffer, yet).

As an empirical question, you can also experiment with the size of the buffer. Theoretically any multiple of k will do (even 1.1*k or so), but in practice they give you different constant factors for space and time.

How do you efficiently track the "worst element" without something like a max-heap? But yeah, this is a fun algorithm. I think I've seen it before but can't place it, do you remember where you came across it?
My failure was misreading it as most common k rather than max k.
Most common k is super-interesting because it can't be solved in one pass in constant space!

https://en.wikipedia.org/wiki/Streaming_algorithm#Frequent_e...

What you are quoting solves a very different problem. It doesn't give you the most common k (in general).
Why is that interesting? Intuitively a worst-case could be a stream of n-1 unique elements out of n with the duplicate at the end, so there is no way around O(n) space. Any element could be the most common so you must keep them all.
Sure, a similar trivial argument applies to the linear-space lower bound for set membership. But these linear lower bounds motivate the search for approximate techniques with sublinear lower bounds (although bloom filters or fingerprint tables are not actually sublinear).
The algorithms on the Wikipedia page quoted actually solve a different problem. And they can do that in constant space.

So if someone tells you that one item in the stream is repeated so often that it occurs at least p% of the time (say 10%), then these algorithms can find such an element. But eg if they are multiple elements that occur more than p% of the time, they are not guaranteed to give you the one that occurs the most often. Nor are they guaranteed to give you any meaningful output, if the assumption is violated and no element occurs at least p% of the time.

With an equality that returns true/false, this guarantees correctness. If there can be 3 best/biggest/smallest values, this technique works.
What? The algorithm is completely symmetrical with respect to smallest or largest, and fully correct and general. I don't understand the problem with unique values. Could you provide a minimal input demonstrating the issue?
I cant because I completely misread the wiki article before commenting and have now read it more carefully and realized I was wrong. Specifically I went in thinking about top 3 most common value.
With non-mutable “streaming” input, there is an O(n) algorithm to obtain the unsorted top k with only O(k) extra memory.

The basic idea is to maintain a buffer of size 2k, run mutable unsorted top k on that, drop the smaller half (i.e the lowest k elements), then stream in the next k elements from the main list. Each iteration takes O(k), but you’re processing k elements at a time, so overall runtime is O(n).

When you’re done, you can of course sort for an additional k*log(k) cost.

Maybe they do have that optimization and that explains the 3.59 MiB peak memory usage for ~600MB of integers.