Hacker News new | ask | show | jobs
by heinrichhartman 531 days ago

    int mid =(low + high) / 2;
> Specifically, it fails if the sum of low and high is greater than the maximum positive int value (2^31 - 1).

I would really challenge calling this kind of effects "bug" or "breakage".

It's like calling Newtons law of gravity broken, because it's not accurate at predicting how galaxies move.

Things are engieered and tested for a certain scale.

Knowing which tools to use at which sacle is part of the craft of engineering.

6 comments

Sometimes they’re engineered and tested for a certain scale.

More often they’re engineered and tested for an arbitrary scale. The limits aren’t considered, behavior at the edges isn’t accounted for, and it’s assumed it will be good enough for real world inputs.

The use of `int` tends to be a dead giveaway. There are some cases where it’s clearly correct: where the spec says so (like argv), where you’re starting from a smaller type and it’s impossible for the calculations to overflow in an int (like adding two uint8), that sort of thing. And there are cases where it’s subtly correct, because you know the range of the value is sufficiently limited, either by mechanics or by spec.

But most of the time, int gets chosen because it’s the apparent default and it’s easy to type. No analysis has been done to see if it’s correct or if you want to declare your code to only support inputs in a certain range.

It’s really clear if you’ve written a binary search (or anything else that works on general arrays) in C and you use int as the index type. There’s pretty much no scenario where that makes sense. In theory you could analyze the entire program and prove that over-large arrays are never passed in, and keep doing it to make sure it stays that way, but that’s not realistic. If the programmer actually took one second to think about the appropriate data types, they’d use size_t rather than int.

You can still have this bug with size_t, of course. But it won’t be “this falls apart with arrays over 1G elements on 64-bit systems that can easily handle them.” If you declare that you wrote the obvious midpoint calculation with size_t because you didn’t intend to support byte arrays larger than half the address space, it’s at least plausible.

i write c++, but i had to teach myself and always wondered why others use imprecise types. portability is one possibility, but then you can't know if your datastructure will break for a given input
History and tradition at this point. Bit-sized integers and the other “meaningful” integer types like size_t weren’t added to the languages themselves until C99 and C++11. A lot of us learned those languages before that, and lots of code still exists from that time, or at least code bases that have evolved from that time.

I think it actually comes from the opposite of portability. Access to different kinds of systems wasn’t common then. If you were learning and working on a system where int is 32 bits and pointers are 32 bits, and other possibilities are just vague mentions in whatever books you’re learning from, it’s very easy to get into the habit of thinking that int is the right type for a 32-bit quantity and for something that can hold a pointer.

The lack of explicitly sized ints is actually a pro-portability feature but it prioritizes speed and ease of implementation of arithmetic operations over bitwise operations. The minimum ranges for each type can be used as a guide for average users to write correct and portable arithmetic and carefully-written bitwise operations. But most people would rather not think about the number of bits being variable at all.
Sort of. It was kind of handy when int would be the natural machine size, 16-bit on 16-bit hardware, 32 on 32. But making int be 64-bit leaves a gap, so it’s generally stuck at 32-but even on 64-bit hardware. And then people couldn’t agree on whether long should be 32 or 64 on 64-bit platforms, so now none of the basic integer types will typically give you the natural machine size on both 32 and 64-bit targets. At this point, if you want the “biggest integer that goes fast on this hardware” then your best bet is probably intptr_t or size_t.
There were/are machines where the char size is not 8 bits, and the ints are not sized in powers of 2. These machines are now rare but I think they still exist. This references some historical examples: https://retrocomputing.stackexchange.com/questions/12794/wer...
Oh wow, I didn't know size_t was so recent.
At least for C++, it's older than C++11; a lot of us used for a long time the "C++0x" pseudo-standard (which is mostly the draft of what later became C++11; as the C++0x name indicates, it was originally intended to be finished before 2010), and on most C++ compilers headers and types from C99 were available even when compiling C++ code (excluding MSVC, which really dragged their feet in implementing C99, and which AFAIK to this day still hasn't fully implemented all mandatory C99 features).
I believe it was somewhat older as part of typical C and C++ implementations, but don’t get standardized for a while. A big part of the older C and C++ standards are about unifying and codifying things that implementations were already doing.
I'm not sure what you mean by "imprecise types", but if you mean something like using an `int` for an array index instead of `size_t` or something, I can tell you why I do it. Using `int` lets you use -1 as an easy invalid index, and iterating backwards is a straightforward modification of the normal loop: `for (int i = max; i >= 0; --i)`. That loop fails if using `size_t`, since it is never negative. Actually `size_t` may not even be correct for STL containers, it might be `std::vector::size_type` or something. Also, I don't think I've encountered an array with more than 2 billion items. And some things, like movie data, are usually iterated over using pointers. As you say `int` is easy to type.

Also, for something like half my programming life, a 2+GB array was basically unobtainable.

By precise, I meant more the byte width (uint32_t vs uint64_t etc). The other kinds of types help you track what the purpose of something is, but don't really assist with correctness at the machine level.

In my work, I have a lot of data that is > 2GB, so int32_t vs uint32_t is very meaningful, and usually using a uint32_t is just delaying upgrading to int64_t or uint64_t.

Going in the other direction, a point cloud can usually be represented using 3 uint16_t and that saves a lot of memory vs using uint32_t or uint64_t.

If you want an index that can go negative, then the right type is ssize_t, not int.
> certain scale

Make it explicit. If the array is too large, throw an IllegalArgumentException. Document the limit. Then I agree with you.

Otherwise, if an allowed input crashes the program at runtime with a random exception, I respectfully disagree.

IllegalArgumentException is OK in your book, but OverflowException is not? It's very rare that I actually care which exception type is thrown, but it's a little more polite to throw more clear and reliable exception types. But saying "this could be a little more polite and therefore YOUR CODE HAS A BUG" would make me never want to work with you again.
Explict OverflowException could be ok but it may not tell me the root cause. A random index error is not ok at any time.
Then you should absolutely stay away from C :)
Words to live by!
I do.

But the example was Java.

The problem is that it's not that much harder to make it work for all the valid inputs.

Not doing that is not good enough.

Another example is summing lots of floats naively instead of using Kahan's algorithm.

It's like we had a theory of gravity that doesn't work on Mars because we have unnecessary division by (m-Mars' mass) in our laws :) It wouldn't be good physics.

It's not much harder in this toy example. In real examples what this looks like is a ton of bikeshedding and arguing about minutiae instead of improving the system in ways that are orders of magnitude more valuable to everyone. The truth is it's far easier to waste time on this mostly meaningless crap, exchanging one exception for another, than it is to think deeply about the problem you're actually trying to solve. It's lazy.
In real world you're not implementing binary search but using one implemented already.

Hopefully implemented by someone who cared enough to avoid bugs or at least document the range of the arguments.

In the real world I don't have an array with over 2^30 elements 99.75% of the time. If I did need to process an array with over 2^30 elements, I'd have dozens of reasons to worry that standard techniques might start failing, and would therefore carefully consider and test everything to a much higher degree. I'd want a safety margin of at least 10x, which means I'm designing for 2^33+ and already know I need to avoid int32 everywhere I can. The type signature of a binary sort returning int32 in such a case should already be triggering alarm bells.

This is a bit like arguing that every building humans have built is flawed because it wouldn't withstand a significant meteor strike. That's not how we do it. We consider maximum likely use cases and then build in safety margins of at least 10^2, sometimes up to 10^6. If you think there's a possibility you're dealing with hundreds of millions of elements, you stop using int32 entirely. You don't wait for 100 million to turn into 4 billion and crash.

Is it worth making the algorithm slower just to have it work on extreme edge cases?
> Is it worth making the algorithm slower just to have it work on extreme edge cases?

Yes! If an API has no documented (and preferably enforced) limits on its arguments, the caller has every right to assume the full range of the argument types is supported.

Else it is very much a bug.

“Be reasonable, dude” is ok when you order 1,000 beers. It’s not an appropriate response when calling a binary search API.

Is it worth to make the algorithm faster at the cost of throwing surprise OutOfBounds exceptions in some extreme edge cases?

Maybe, but only if you - and only you,and not an attacker can control the case you are in.

If an attacker can somehow make sure that there is an array with 2³⁰ elements, you have worse problems than a binary search crashing.
Why do you think this algorithm only applies to arrays? Why do you think this algorithm doesn't apply to this sine lookup table that the compiler placed at the end of the memory in the microcontroller?
Because the article is about binary searching an array. Obviously algorithms must be adapted to what they are used for.
It's as simple as replacing one NULL character with something else. Or 1 length field.
If that happens, you’re going to get an out-of-bounds error no matter how you implement your binary search.
Well it's either that or document the domain.
Nice example with the 1/(m-m_mars)!
+1 for mentioning Kahan.
> Certain scale

Or just put the algorithm in a 16-bit microcontroller, put some table that needs to be looked up (think precomputed sine table), put that table near the end of the memory range, and just make the mistake to call binary search specifying the start and end memory positions of the table.

These implementations are definitely broken when the specification goes like "you just pass in your array here and it will perform binary search on it for your value." Yes, you could constrain this spec, but come on...

It's such a blatant example for a bug, that I struggle to believe how can anyone even remotely conceive and pivot to the idea that "nuh-uh, this is a bad spec not a bad implementation!".

I would disagree. The inputs to these functions are user controlled and so can be forced to break, whereas humans cannot change how gravity works.
But in what realistic scenario would a user be able to put in ≥2^31 while not being able to put in 2^32-1 (which probably breaks many more things from innocent increments or similar) or 2^100?
And further this all assumes they used int vs. long. It can be “wrong” in that it only works for arrays of under 2^63 elements without that ever being a possibility.

Production code is often filled with edge case bugs that simply never come up. Works for 100x the expected use case is generally good enough if you’re approaching the limits where 2^31 is a potential issue then you are also approaching the case where 2^32 definitely will be.

Hacking, of course. Overflows are one of the primary ways that hackers gain control of systems.
That’s irrelevant for majority of software.
This mindset is why hackers are able to exploit most systems.
Which isn’t a problem as exploiting most software is meaningless. Wow someone can hack software they already have root access to the machine for whatever will we do.

It’s management and developers not treating software where it is meaningful for someone to hack as a different category that’s an actual problem.

2^32-1 is almost always a possibility when 2^32 is, but there are many cases where those are possible but 2^100 is not. Basically anything where the value is a count of something rather than raw input fits the bill. How many characters, lines, or files do you support? 2^32 is a totally feasible number in many contexts. 2^100 is physically impossible, there isn’t enough matter.
If you accept 2^32, then code using 32-bit ints is definitely broken on it and thus the OP question of the issue on half that is irrelevant. Which is my point - widening the acceptable input range from 2^31 to 2^32 (or in the case of signed integers, from 2^30 to 2^31; give or take 1 of course) just "fixes" one small case of the actual core issue of nearly any arithmetic anywhere being wrong if you don't explicitly constrain input sizes.
I agree on there not being much difference between 2^30/31/32. But it’s not “nearly any arithmetic.” If your size is an actual data size, then 2^64 is fine.
Right, with 64-bit ints things are a lot nicer. Though you can still run into some issues on "generate this much data" tasks as opposed to "operate over existing data of this size", though perhaps less exploitably so.