The k'th largest element in an array can also be found by the nth_element() algorithm already included in the STL. It has linear complexity as opposed to the O(n log k) and O(k log n) variants proposed here.
But std::distance(first, last) is how n is defined.
std::nth_element is typically implemented using a fancy version of quickselect called introselect. You can imagine quickselect as a version of quicksort where the recursive call is only executed on the side which contains the element. Introselect adds some fanciness to ensure worst-case linear running time if quickselect takes too long (bad pivot selection).
No, as per my previous comment, it t does not use a heap. It's like quicksort, but only doing one of the recursive calls:
- choose a pivot p randomly from the elements
- partition into elements <= p and > p. This takes time O(n).
- if the number of elements <= p is at most k (the rank of the element we want), do a recursive call on these elements. Otherwise recurse on the other elements (those >p).
Because a random pivot splits the elements well enough most of the time, this has an expected running time of O(n): 1 + ½ + ¼ + ⅛ + … < 2 in the best case, and similar constants if the partitioning is a little worse. But in the worst case, when the pivot is the smallest/largest element in each iteration, it's O(n²). That's where the fanciness of introselect comes in: if quickselect doesn't converge, it switches to an algorithm with worst-case O(n) time, which is slower in most cases but saves the day when quickselect has trouble.
and median of medians can be used to implement quicksort with O(NlogN) worst case. It is usually not worth it as other hybrid quicksorts like introsort are faster in practice.
On an unrelated note, I was once asked to prove the upper bound for median of median on an interview...
Yup, exactly. It's a fun area that gets even more interesting when you look at distributed systems, where every node only stores a small portion of the overall data and you want to keep communication volume low.
Edit: see comment below for why it actually is O(n) and not O(n logk) as I had thought.