Hacker News new | ask | show | jobs
by asafira 3392 days ago
It is only linear in std::distance(first, last), so it very well might be (and likely is) nlogk .

Edit: see comment below for why it actually is O(n) and not O(n logk) as I had thought.

1 comments

But std::distance(first, last) is how n is defined.

std::nth_element is typically implemented using a fancy version of quickselect called introselect. You can imagine quickselect as a version of quicksort where the recursive call is only executed on the side which contains the element. Introselect adds some fanciness to ensure worst-case linear running time if quickselect takes too long (bad pivot selection).

very cool! I stand corrected, I didn't think of this when I considered it. Sorry for the mix-up
Its O(n logk) then? It maintain a heap I'm guessing?
No, as per my previous comment, it t does not use a heap. It's like quicksort, but only doing one of the recursive calls:

- choose a pivot p randomly from the elements

- partition into elements <= p and > p. This takes time O(n).

- if the number of elements <= p is at most k (the rank of the element we want), do a recursive call on these elements. Otherwise recurse on the other elements (those >p).

Because a random pivot splits the elements well enough most of the time, this has an expected running time of O(n): 1 + ½ + ¼ + ⅛ + … < 2 in the best case, and similar constants if the partitioning is a little worse. But in the worst case, when the pivot is the smallest/largest element in each iteration, it's O(n²). That's where the fanciness of introselect comes in: if quickselect doesn't converge, it switches to an algorithm with worst-case O(n) time, which is slower in most cases but saves the day when quickselect has trouble.

See "median of medians algorithm": https://en.wikipedia.org/wiki/Median_of_medians

It's hairier than a partial quicksort, but (if you erase the proof and weigh the chalk dust) guaranteed linear time.

and median of medians can be used to implement quicksort with O(NlogN) worst case. It is usually not worth it as other hybrid quicksorts like introsort are faster in practice.

On an unrelated note, I was once asked to prove the upper bound for median of median on an interview...

Thanks, right, the idea to is to only recurse down one side on each pass right? This is the (sub)family of "selection" algorithms?

This is interesting I had not heard of introselect before, I will have to read up on this.

Yup, exactly. It's a fun area that gets even more interesting when you look at distributed systems, where every node only stores a small portion of the overall data and you want to keep communication volume low.
>"It's a fun area that gets even more interesting when you look at distributed systems"

Can you elaborate a little bit, specifically how selection algorithms help reduce "chatter" in distributed systems? I am not familiar with this and this sounds like an interesting context.

I have heard of Merkle Trees for similar but that's obviously hashing and not selection algorithms, or is there some connection I am not making?