Hacker News new | ask | show | jobs
by celeritascelery 1384 days ago
I see this common theme among very fast practical algorithms (like timsort or pdqsort) where there is not some secret math or algorithm, rather they are just a bunch of special cases based on heuristics. Often they involve specific knowledge of the hardware instead of treating software as abstract. To me this is the big difference between “computer science” and “computer engineering”.
5 comments

Much of this is just threshold effects for algorithm performance, not special cases per se.

You don't even need to have specific knowledge of the hardware as long as you can identify cross-over points where one algorithm starts significantly outperforming others. An old HPC trick is to write software that thoroughly measures several algorithm strategies on your specific hardware environment and then code-gens an algorithm that has an optimal set of strategies and strategy-switching thresholds. The meta-algorithm is mostly focused on cheaply detecting these thresholds at runtime.

For a lot of such algorithms there are also pretty cheap ways to remove jitter from the measurements (in case you're operating on top of a general purpose OS or something) -- you expect for many algorithms that in the absence of external factors the runtime is convex monotonic as a multivariate function of the input sizes and that external confounders don't speed things up. Under that assumption you can measure the runtime for many different input sizes and de-jitter each data point by pushing down that convexity constraint, even if for each input size you only gather one single noisy data point.

Where possible, I always liked arrays of function pointers for each next power of 2 in the input sizes. A few popcounts and a jump later, and you're executing the right routine. It's not great if you're throwing a general-purpose tool against tiny problems, but it's not a huge amount of overhead either, and it's dead simple to code.

Can you give an example of this convex assumption push down? It doesn’t seem right to me. For instance, due to cache line sizes and other blocking/buffering effects throughout hardware and the OS, I actually would expect “staircase” functions which aren’t convex.

Maybe that’s convex if you smooth enough. But depends on the buffer sizes.

In 2000 came across a binary space partition algo that did this, but used a small scheme interpreter to codegen the c++ BSP code. Would love to find out the library name... have now forgotten it.
I had the same takeaway when I read these papers. It felt like a game of whack-a-mole.

There's a massive performance benefit to doing this at the cost of implementation complexity. I haven't studied the implementations or tried my hand one, but I get the impression that these are tough to implement correctly in a way that takes full advantage of the hardware.

(In that sense, it's awesome that the researchers also did the legwork to implement and maintain a library!)

>I see this common theme among very fast practical algorithms (like timsort or pdqsort) where there is not some secret math or algorithm, rather they are just a bunch of special cases based on heuristics

The core here, which is bitmap storage and the basic optimization are simple, but solid and general. Mathematically it takes less space to store data as positions on a bitmap rather than fully spelt associations, and it takes even less space to seggregate the bitmaps in chunks.

This will hold and be smaller and faster in any computer. So, it's not some case of special case based on heuristics, either related to the specific frequencies or sample characteristics of some particular set of data, or of specific CPU peculiarities or whatever.

More exotic optimizations piled on top, sure. But "compressed bitmaps" themselves as a concept, not.

No this is still comp sci. Comp sci doesn’t mean that practical considerations in algorithm design are ignored. Computer engineering has to do with the actual design and construction of computing machinery. It’s an accredited engineering field and the undergraduate coursework includes electronics, semiconductors, computer architecture and generally a lot of overlap with electrical engineering. I had to take one dumb programming course - all the data structures and algorithms stuff I had to learn elsewhere.
timsort and pdqsort seem to be significantly slower than radix sort for random inputs, but presumably radix sort could also benefit from a pile of pattern-finding heuristics.
timsort and pdqsort (two comparison sort methods) and radix sort (a non-comparative sorting algorithm) have different albeit overlapping domains of applicability.

Given a large number of 32-bit integers, radix sort is indeed significantly faster.

While I would reach for a comparison sort method if I had a large number of arbitrary-length Unicode strings, which I wanted to sort in a case-ignoring order.

Also, I found timsort faster than radix sort when there was a small number (as I recall, <100 or so) of elements.

> timsort and pdqsort (two comparison sort methods) and radix sort (a non-comparative sorting algorithm) have different albeit overlapping domains of applicability.

Most comparators are of the form "compare by this, then if tied, compare by that, then if tied, compare by the other thing" which is pretty well suited to radix sort. You are correct though.

> While I would reach for a comparison sort method if I had a large number of arbitrary-length Unicode strings, which I wanted to sort in a case-ignoring order.

It seems like the radix sort is likely to be a lot faster for this too, mainly due to cache effects. If you have a dataset in mind I'll be happy to give it a shot.

> Also, I found timsort faster than radix sort when there was a small number (as I recall, <100 or so) of elements.

For sure. 100 isn't so far from the threshold where a radix sort should fall back to something else anyway.

> which is pretty well suited to radix sort

Yes, the general approach is to convert the input data into a fixed-length bit-string with the same sort order as the input.

Your example assumes that construction overhead is short. If tie-breaking is rare, and breaking the tie requires an expensive operation, then the trade-off point for radix might be much higher than 100 elements.

The fixed-length requirement works well for small items with relatively equal-length fields. Ragged items, like Wikipedia titles, causes a problem. There is one title which is 253 bytes long. Now, Wikipedia titles are limited to 255 bytes, so radix is certainly directly applicable, but 1) it changes the trade-off point, and 2) reduces cache effects.

Finally, it requires a sort-order-preserving transformation. I mentioned case-insensitive collation of Unicode strings as a well-known difficult problem. I do not believe there is mapping to an order-preserving representation which can be done bitwise. At the very least, it will be difficult to support all of the collation styles that currently exist (eg, French collation is different than Dutch).

This is basically all incorrect. Maybe you are writing about LSB radix sort? I am writing about MSB radix sort. You can get some off-the-shelf MSB radix sort for sorting strings of arbitrary lengths here[0]. While this repository is mainly about parallel string sorts, I've found a version of this MSB radix sort that does not perform a bunch of unnecessary copies and caches more bytes at a time[1] to be competitive with the parallel sorts in the repo.

In general MSB radix sort will have to look at the same parts of the input elements as multi-key quick sort, but one hopes that it gets to make fewer passes over the array. A comparator-based sort would look at about the same parts of the input elements as well, but it would look at them many times more than necessary.

> Finally, it requires a sort-order-preserving transformation. I mentioned case-insensitive collation of Unicode strings as a well-known difficult problem. I do not believe there is mapping to an order-preserving representation which can be done bitwise. At the very least, it will be difficult to support all of the collation styles that currently exist (eg, French collation is different than Dutch).

The requirements are a bit underspecified, but I think these can be solved by unicode normalization + tolower + codepoint-wise comparison, which is probably what you'd do in your comparator for a comparison-based sort as well.

[0]: https://github.com/bingmann/parallel-string-sorting/blob/mas...

[1]: https://github.com/dendibakh/perf-challenge6/blob/Solution_R...