| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jkabrg 3281 days ago

[Post-edit] I made several edits to the post below. First, to make an argument. Second, to add paragraphs. [/Post-edit]

Tl;dr version: It seems to me you should either use heapsort or plain quicksort; the latter with the sort of optimisations described in the linked article, but not including the fallback to heapsort.

Long version:

Here's my reasoning for the above:

You're either working with lists that are reasonably likely to trigger the worst case of randomised quicksort, or you're not working with such lists. By likely, I mean the probability is not extremely small.

Consider the case when the worst case is very unlikely: you're so unlikely to have a worst case that you're gaining almost nothing for accounting for it except extra complexity. So you might as well only use quicksort with optimisations that are likely to actually help.

Next is the case that a worst case might actually happen. Again, this is not by chance; it has to be because someone can predict your "random" pivot and screw with your algorithm; in that case, I propose just using heapsort. Why? This might be long, so I apologise. It's because usually when you design something, you design it to a high tolerance; a high tolerance in this case ought to be the worst case of your sorting algorithm. In which case, when designing and testing your system, you'll have to do extra work to tease out the worst case. To avoid doing that, you might as well use an algorithm that takes the same amount of time every time, which I think means heapsort.

1 comments

orlp 3281 days ago

The overhead of including the fallback to heapsort takes a negligible, non-measurable amount of processing time that guarantees a worst case runtime of O(n log n), and to be more precise, a worst case that is 2 - 4 times as slow as the best case.

Your logic also would mean that any sorting function that is publicly facing (which is basically any interface on the internet, like a sorted list of Facebook friends) would need to use heapsort (which is 2-4 times as slow), as otherwise DoS attacks are simply done by constructing worst case inputs.

There are no real disadvantages to the hybrid approach.

link

jkabrg 3281 days ago

Thanks for your reply.

> Your logic also would mean that any sorting function that is publicly facing (which is basically any interface on the internet, like a sorted list of Facebook friends) would need to use heapsort (which is 2-4 times as slow), as otherwise DoS attacks are simply done by constructing worst case inputs.

Why is that a wrong conclusion? It might be, I'm not a dev. But if I found myself caring about that sort of minutiae, I would reach exactly that conclusion.

Reasons:

* the paranoid possibility that enough users can trigger enough DoS attacks that your system can fall over. If this is likely enough, maybe you should design for the 2-4x worst case, and make your testing and provisioning of resources easier.

* a desire for simplicity when predicting performance, which you're losing by going your route because you're adding the possibility of a 2-4x performance drop depending on the content of the list. Ideally, you want the performance to solely be a function of n, where n is the size of your list; not n and the time-varying distribution of evilness over your users.

Finally, adding a fallback doesn't seem free to me, because it might fool you into not addressing the points I just made. That O(n^2) for Quicksort might be a good way to get people to think; your O(n log n) is hiding factors which don't just depend on n.

link

sqeaky 3281 days ago

Presuming your internet based users will enter malicious input is not paranoia, it is the only sane starting place.

link

carussell 3281 days ago

I'm sympathetic because it may not be clear:

pdqsort is a hybrid sort; when it encounters apparently pathological input, it switches from a strategy resembling quicksort to heapsort—it doesn't share quicksort's worst case characteristics.

Your thesis is wrong:

> you might as well use an algorithm that takes the same amount of time every time, which I think means heapsort

Heapsort has a variable runtime. It will selectively reheap. Whether this happens is dependent on the state of the heap and your next element, which means the total number of times you reheap will vary with input.

link

jkabrg 3281 days ago

> pdqsort is a hybrid sort; when it encounters apparently pathological input, it switches from a strategy resembling quicksort to heapsort—it doesn't share quicksort's worst case characteristics.

I understand how hybrid sorts work. I thought that would be clear enough. I guess when you're a stranger on the internet, people don't automatically trust you to know what you're talking about. I imagine this is even truer when you say something counterintuitive or controversial. In spite of learning that, I'm responding chiefly because of that quote.

> Your thesis is wrong: > > Heapsort has a variable runtime

Let's not get hung up on heapsort. My thesis is if you have an algorithm with a gap between the worst case and average case, then you shouldn't use it on the web. That gap might not manifest as gap in the big O -- it could be 4x slower in the worst case than the average case, and that would still be bad.

To see why, try imagining this world of pure evil:

You create a website that uses gapsort. Gapsort is an algorithm that is 4x slower in the worst case than it is in the average case, for a fixed value of n (the size of the input). On top of that, triggering the worst case by accident is hard. You deploy your site for several months, buy servers, and get a userbase. In your performance testing, you only encountered the average case, so you only provisioned hardware for the average case. Now imagine that suddenly all your users turn evil, and start triggering the worst case; this leads to a sudden 4x performance drop. You may have underprovisioned for this.

So my thesis is it looks like having differences from the worst case and average case is great, on average. But in a hostile world, that actually makes things more complicated. When doing empirical performance testing, you'll test for the average. Moreover the gap can be just 4x; this will not manifest as a difference between the big Os of the worst and average cases (as I've said previously).

Changing from quicksort to heapsort on some inputs may manifest as a performance gap between the QS case and fallback case. Maybe that's not such a good idea. In fact, maybe you shouldn't use any quicksort variant, even introsort, in a potentially hostile environment, because of the unavoidable average-to-worst-case gap. In introsort, that gap is merely a different coefficient, but it's still a gap.

I hope that wasn't too repetitive.

[edit] I deleted and reposted this because the edits weren't being applied.

[edit] Done it twice now.

link

bmm6o 3280 days ago

I hear what you're saying, but it seems like this is potentially a problem only in cases where you are devoting significant resources to sorting. Like if sorting takes only 1% of your CPU time, worst-case input will only bump that to 4% - and that's only if every user starts hitting you with worst-case requests. Even if you spend more, it's a question of how much work can the malicious users cause you to do.

link