There's some logic to detect if the data is already sorted or reverse sorted, but I think that's only triggered at the very beginning and not at every recursion level. If the data is almost sorted it seems to become a little bit faster. Take a look at the appendix of the paper where there are timings for various distributions.
There's also some logic to handle the case where there are a lot of equal elements which also results in faster performance than the completely random case.
On my office Xeon E5-2690 machine when using multiple threads the runtime decreases like this
840 ms for std::sort
372 ms for IPS4o sequentially
201 ms for 2 threads
104 ms for 4 threads
53 ms for 8 threads
33 ms for 16 threads