Quite. I am puzzled by the article. As even the author points out, binary search is tail-recursive, and therefore trivially transformed to an iterative form. So why on earth profile a recursive implementation?
That was the big WTF for me too. Before you start microbenchmarking and talking about the machine instructions for a function call, get rid of the damn function call.
And if we're talking about the cost in cycles of the operations for jump versus compare, shouldn't we also then consider the cost of the calculation of the pivot points?
I haven't done assembly language in a couple of decades, but it seems to me that the cost of calculating the traditional pivot point will be rather cheaper than that for the dual pivots.
At least back in the day, a division by two was a trivial operation (arithmetic shift right by 1), whereas the division by three would require an actual calculation: not a big deal, but more expensive than the ASR.
But the compiler will( should, look at generated code ) optimize the constant division anyway.
---
Behold, division by three using only addition and shifting( works up to 32767 ):
unsigned int div3upto32767( unsigned int n )
{
return ( ( n << 13 )+( n << 11 )+( n << 9 )+( n << 7 )+( n << 5 )+( n << 3 )+( n << 1 )+n ) >> 15 ;
}
---
This one works up to 32767, and then produces a wrong result every ~32767 numbers or so. The result is of by one. When you get over a million, every number is of by a couple of digits.
uint64_t div3almost( uint64_t n )
{
n -= ( n >> 15 ) ;
return ( ( n << 13 )+( n << 11 )+( n << 9 )+( n << 7 )+( n << 5 )+( n << 3 )+( n << 1 )+n ) >> 15 ;
}