| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rxin 2383 days ago

Disclaimer: I designed the system that won the 2014 GraySort based on Apache Spark, and the same system was extended by a different team to win the 2016 record in cloud sort.

"Merge sort is the only algorithm that can effectively be parallelized" is not a good explanation for the lack of quick sort usage in the sort benchmark. In a parallel sorting system, you can decouple the merging step from the sorting of partial runs. That is, you'd still want the fastest algorithm for sorting partial runs, and then implement a fast merge operation.

So what's the reason? I can only offer speculations based on my own experience.

Contrary to what the name suggests, the sort benchmark is actually not designed primarily to test the in-memory sorting part of the system. It is designed to stress the entire system. When you are required to sort 100TB, or even PBs+ of data, your software system's ability to optimize for I/O (including reading from/writing to disks, sending data across the network, pipelining) is far more important than the actual sorting itself. Of course, you shouldn't have a very slow sort algorithm either.

The reason I used TimSort was because it was relatively fast, already implemented in Apache Spark, I was really familiar with it, so I could tweak it to do what I needed. The vast majority of the time I spent in the sorting part was to tune memory layout (to make sure there's 0 JVM garbage collection, and good cache locality), and to remove any virtual function calls. Once tuned, the sorting time was a very small fraction of the overall time. Using TimSort (which can merge partially sorted runs very quickly) also allowed me to implement just one highly optimized sorting algorithm, rather than having to implement one algorithm for sorting the partial runs, and another algorithm for merging them.

The sort benchmark also defines the key space (10 byte key) as fixed length. My guess is that the fastest in-memory sorting algorithm for this case would be a highly optimized radix sort, which the 2016 Gray sort record used.