Hacker News new | ask | show | jobs
by Someone 700 days ago
FTA:

“Proof of Average O(n)

On average, the pivot will split the list into 2 approximately equal-sized pieces. Therefore, each subsequent recursion operates on 1⁄2 the data of the previous step.”

That “therefore” doesn’t follow, so this is more an intuition than a proof. The problem with it is that the medium is more likely to end up in the larger of the two pieces, so you more likely have to recurse on the larger part than on the smaller part.

What saves you is that O(n) doesn’t say anything about constants.

Also, I would think you can improve things a bit for real world data by, on subsequent iterations, using the average of the set as pivot (You can compute that for both pieces on the fly while doing the splitting. The average may not be in the set of items, but that doesn’t matter for this algorithm). Is that true?

1 comments

If I'm understanding correctly, the median is actually guaranteed to be in the larger of the two pieces of the array after partitioning. That means on average you'd only discard 25% of the array after each partition. Your selected pivot is either below the median, above the median, or exactly the median. If it's below the median, it could be anywhere in the range [p0, p50) for an average of around p25; if it's above the median, it could be anywhere in the range (p50, p100] for an average of around p75.

Since these remaining fractions combine multiplicatively, we actually care about the geometric mean of the remaining fraction of the array, which is e^[(integral of ln(x) dx from x=0.5 to x=1) / (1 - 0.5)], or about 73.5%.

Regardless, it forms a geometric series, which should converge to 1/(1-0.735) or about 3.77.

Regarding using the average as the pivot: the question is really what quantile would be equal to the mean for your distribution. Heavily skewed distributions would perform pretty badly. It would perform particularly badly on 0.01*np.arange(1, 100) -- for each partitioning step, the mean would land between the first element and the second element.

> If I'm understanding correctly, the median is actually guaranteed to be in the larger of the two pieces of the array after partitioning.

Only in the first iteration. There’s a good chance it will be in the smaller one in the second iteration, for example.

So, your analysis is a bit too harsh, but probably good enough for a proof that it’s O(n) on average.

> Heavily skewed distributions would perform pretty badly

That’s why I used the weasel worlds “real world data” ;-)

I also thought about mentioning that skew can be computed streaming (see for example https://www.boost.org/doc/libs/1_53_0/doc/html/accumulators/...), but even if you have that, there still are distributions that will perform badly.