Hacker News new | ask | show | jobs
by cperciva 5604 days ago
The asymptotic optimality of samplesort is in the sense that it takes (1 + o(1)) N log N / log 2 comparisons, i.e., you can't beat it by any constant factor.

Bentley's comments from that email don't make any sense to me. I can't understand what he was thinking, unless he's just unusually prone to flattering other people's work (I don't know Bentley personally, but I've met other people who do this, much to my irritation).

1 comments

> The asymptotic optimality of samplesort is in the sense that it takes (1 + o(1)) N log N / log 2 comparisons, i.e., you can't beat it by any constant factor.

Why isn't it used everywhere that people currently use Quicksort? I've never actually heard of anyone using Samplesort except in a parallel context.

Two reasons: 1. It's a PITA to implement; 2. You need a large N to overcome the o(1), and at that point you're usually dominated by the performance of your memory hierarchy rather than by the cost of doing compares.

Last I heard, python's built-in sort was a type of samplesort, though, so it isn't completely unused.

The previous Python sort was a type of samplesort, but the current one is a “adaptive, stable, natural mergesort” by Tim Peters (“timsort”). On random data I think it performs about the same or slightly worse, but when data has some structure, it performs much better. See http://bugs.python.org/file4451/timsort.txt
Your #2 suggests that it would be slower than ordinary Quicksort on normal-sized datasets. But dual-pivot quicksort is faster than ordinary Quicksort on normal-sized datasets. So in at least one sense, dual-pivot is not a step in the direction of samplesort.

Python's built-in sort is timsort, last I heard, which is a variant of mergesort that looks for existing sorted or backwards runs in order to run faster in the common cases of mostly-sorted or mostly-backwards data. It's true that Python's built-in sort was a samplesort variant until 2002, though, which I didn't know.

(It does seem like Samplesort's virtue of minimizing comparisons would be particularly desirable in an environment like Python, where every comparison potentially involves a call into slow interpreted code.)

I just read the original paper on samplesort, so this might not be the most up to date description, but the basic idea is as follows:

- From an input sequence of length n, choose a random subsequence of length k.

- Sort the subsequence.

- Partition the remaining (n-k) elements of the original sequence into the (k+1) partitions given by the subsequence.

- Sort the (k+1) partitions.

In the paper, the partitioning and sorting is always done using quicksort. Any n*log(n) sorting algorithm should work though, which includes using samplesort recursively.

If you use an optimal value for k (and this might be a significant fraction of n), you can prove that lim_{n->infty} E(C_n)/log(n!) = 1, where E(C_n) is the expected number of comparisons for a sequence of length n.

Now, using samplesort recursively, quicksort is samplesort with k = 1. Double-Pivot quicksort is samplesort with k = 2.