|
|
|
|
|
by nightcracker
2226 days ago
|
|
> I tried that too, but it's slower than Lomuto. That is very strange. I can not reproduce your results in C++, using your code. On my machine (Threadripper 2950x, 64 bit Windows 10, GCC 10.1.0 from MSYS2 MinGW64) my algorithm performs best and your branchless version ends up slower than your branchy one: $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO lomuto.cpp && a 10000000
min_milliseconds=828.1250
median_milliseconds=890.6250
$ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO_BRANCHY lomuto.cpp && a 10000000
min_milliseconds=671.8750
median_milliseconds=671.8750
$ g++ -std=c++17 -O3 -DNDEBUG -DPDQSORT lomuto.cpp && a 10000000
min_milliseconds=343.7500
median_milliseconds=343.7500
On an Intel university machine (Xeon E5-2667 v2 @ 3.3GHz, 64 bit Ubuntu 16.04 LTS, GCC 5.4.0) I get: $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO lomuto.cpp && ./a.out 10000000
min_milliseconds=514.8051
median_milliseconds=536.1984
$ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO_BRANCHY lomuto.cpp && ./a.out 10000000
min_milliseconds=688.2471
median_milliseconds=698.9603
$ g++ -std=c++17 -O3 -DNDEBUG -DPDQSORT lomuto.cpp && ./a.out 10000000
min_milliseconds=340.1935
median_milliseconds=344.7432
Maybe AMD or Windows doesn't like your branchless code or perhaps GCC is generating inferior code for AMD/Windows, as at least on Intel/Linux your Lomuto branchless code can beat the branchy code. But pdqsort (which uses branchless Hoare partitioning) consistently beats both.I didn't change the benchmark, I only added pdqsort to yours, my full changes are the simple inclusion of 1 header and 3 lines of code: https://github.com/orlp/lomuto/commit/29381176f49e3588f6882c... In order for any interested readers to reproduce, simply clone https://github.com/orlp/lomuto and run the above commands. |
|