Hacker News new | ask | show | jobs
by linknoid 2004 days ago
Usually when I'm doing a binary search, I'm not searching over integers, so the comparison itself will be quite expensive compared to the rest of the loop, so I prefer it like this:

  int BinarySearch<t>(t[] array, t find, int min, int max, Comparer<t> comparer)
  {
      int mid = 0;
      while (min <= max)
      {
          mid = (int)(((long)min + (long)max) / 2);
          int compared = comparer.Compare(find, array[mid]);
          if (compared > 0)
              min = mid + 1;
          else if (compared < 0)
              max = mid - 1;
          else
              return mid;
      }
      return ~mid;
  }
2 comments

The bug as pointed by the article would be here:

"Specifically, it fails if the sum of low and high is greater than the maximum positive int value (231 - 1). The sum overflows to a negative value, and the value stays negative when divided by two." -https://ai.googleblog.com/2006/06/extra-extra-read-all-about...

This particular overflow is pretty famous as a bug, but it's always annoyed me. If you have a 32 bit address space and use a 32 bit integer, the only way this would reasonably overflow is if you have a single array of sorted bytes consuming half or more of the address space. Surely there must be a better data structure to use when you're keeping track of sorted bytes filling that much memory. For instance an array of 256 integers can contain the exact same information consuming only a tiny fraction of the space.

As soon as your array is storing objects that are 2 bytes or larger, you can't overflow the unsigned integer any more. The same applies for a 64 bit address and 64 bit integers. Never mind that most 64 bit hardware in existence is actually limited to a 48 bit address space, in which case the bug CAN NOT be exercised even with signed integers.

These issues would be a lot less troublesome if hardware offered basic security, like zero-cost poison-on-overflow and bounds-checked pointers, either of which would solve any risks of just letting it be. But right now there's the pragmatic question of not only whether this is a bug you could hit accidentally, but whether it's a bug you could exploit intentionally to corrupt memory, so it's better to just do the technically-correct thing even if it seems pointless.

It's less concerning for C#, which is a managed language, of course.

In Java (I think that is the implementation language,) longs are 64bit and ints are 32bit, so I think this problem doesn’t come up
It did at one point, I recall Sun had that exact bug at some point back around 1.4, I remember it being fixed. Of course the long conversions alleviate the problem but at a performance cost and its an avoidable cast.
No java doesn't use 64bits at all. It uses unsigned right shift. I mentioned it in another post: "int mid = (low + high) >>> 1;"... And for the older timers (14y ago), there used to be a bug[0] about exactly that.

[0] https://bugs.openjdk.java.net/browse/JDK-5045582

This is my go-to implementation, but I make one little tweak. It’s very unlikely that you will overflow a long, but you can you guarantee that you’ll avoid overflow like this:

mid = min + (max - min) / 2;

In x86 (intel notation) you can also:

add rax, rbx

rcr rax, 1

It explicitly uses the carry flag from the addition as the top bit of the right-rotated result.

Not super useful for anything other than averaging two uints but that's x86 for you.

What I missed in my first pass was

   * Forgetting exactly which register was the low index and which was the high index
   * Underflow if array size was zero from setting the last index to (size - 1)
If the unsigned right shift is defined (java): "int mid = (min + max) >>> 1;"

edit: this is a bit longer version, if signed conversion/shift is not available: int mid = (min >> 1) + (max >> 1) + (min & max & 1)

That's my C# implementation, so long is 64 bits and int is 32 bits. 64 bits will never overflow in my lifetime. That's basically a Google scale amount of data.
can you explain to me how that fixes it? Thanks
There is an implicit invariant that 0 <= min, and min <= max, so max - min is between 0 and int_max inclusive, so you avoid overflowing with that calculation. To be sure that you avoid overflowing at the sum step, you need to prove that min + (max - min) / 2 <= max.
what if max is already overflown due to size of array (without even doing the lo + high calculation?) Then min + (max- min)/2 would be min + (smaller negative value) which would violate the min + (max-min)/2 <= max
Assuming the value of max passed in is "sane" (non-negative and within array bounds), it is strictly decreasing and should never overflow.
This makes sure that none of the quantities being calculated will overflow while still returning the same answer.

Where x is min and y is max, the midpoint is:

(x+y)/2 = x+(y-x)/2 x+y = 2x+y-x x+y=x+y