| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wat10000 531 days ago

Sometimes they’re engineered and tested for a certain scale.

More often they’re engineered and tested for an arbitrary scale. The limits aren’t considered, behavior at the edges isn’t accounted for, and it’s assumed it will be good enough for real world inputs.

The use of `int` tends to be a dead giveaway. There are some cases where it’s clearly correct: where the spec says so (like argv), where you’re starting from a smaller type and it’s impossible for the calculations to overflow in an int (like adding two uint8), that sort of thing. And there are cases where it’s subtly correct, because you know the range of the value is sufficiently limited, either by mechanics or by spec.

But most of the time, int gets chosen because it’s the apparent default and it’s easy to type. No analysis has been done to see if it’s correct or if you want to declare your code to only support inputs in a certain range.

It’s really clear if you’ve written a binary search (or anything else that works on general arrays) in C and you use int as the index type. There’s pretty much no scenario where that makes sense. In theory you could analyze the entire program and prove that over-large arrays are never passed in, and keep doing it to make sure it stays that way, but that’s not realistic. If the programmer actually took one second to think about the appropriate data types, they’d use size_t rather than int.

You can still have this bug with size_t, of course. But it won’t be “this falls apart with arrays over 1G elements on 64-bit systems that can easily handle them.” If you declare that you wrote the obvious midpoint calculation with size_t because you didn’t intend to support byte arrays larger than half the address space, it’s at least plausible.

1 comments

tehjoker 531 days ago

i write c++, but i had to teach myself and always wondered why others use imprecise types. portability is one possibility, but then you can't know if your datastructure will break for a given input

link

wat10000 531 days ago

History and tradition at this point. Bit-sized integers and the other “meaningful” integer types like size_t weren’t added to the languages themselves until C99 and C++11. A lot of us learned those languages before that, and lots of code still exists from that time, or at least code bases that have evolved from that time.

I think it actually comes from the opposite of portability. Access to different kinds of systems wasn’t common then. If you were learning and working on a system where int is 32 bits and pointers are 32 bits, and other possibilities are just vague mentions in whatever books you’re learning from, it’s very easy to get into the habit of thinking that int is the right type for a 32-bit quantity and for something that can hold a pointer.

link

wakawaka28 530 days ago

The lack of explicitly sized ints is actually a pro-portability feature but it prioritizes speed and ease of implementation of arithmetic operations over bitwise operations. The minimum ranges for each type can be used as a guide for average users to write correct and portable arithmetic and carefully-written bitwise operations. But most people would rather not think about the number of bits being variable at all.

link

wat10000 530 days ago

Sort of. It was kind of handy when int would be the natural machine size, 16-bit on 16-bit hardware, 32 on 32. But making int be 64-bit leaves a gap, so it’s generally stuck at 32-but even on 64-bit hardware. And then people couldn’t agree on whether long should be 32 or 64 on 64-bit platforms, so now none of the basic integer types will typically give you the natural machine size on both 32 and 64-bit targets. At this point, if you want the “biggest integer that goes fast on this hardware” then your best bet is probably intptr_t or size_t.

link

wakawaka28 530 days ago

There were/are machines where the char size is not 8 bits, and the ints are not sized in powers of 2. These machines are now rare but I think they still exist. This references some historical examples: https://retrocomputing.stackexchange.com/questions/12794/wer...

link

tehjoker 531 days ago

Oh wow, I didn't know size_t was so recent.

link

cesarb 530 days ago

At least for C++, it's older than C++11; a lot of us used for a long time the "C++0x" pseudo-standard (which is mostly the draft of what later became C++11; as the C++0x name indicates, it was originally intended to be finished before 2010), and on most C++ compilers headers and types from C99 were available even when compiling C++ code (excluding MSVC, which really dragged their feet in implementing C99, and which AFAIK to this day still hasn't fully implemented all mandatory C99 features).

link

wat10000 531 days ago

I believe it was somewhat older as part of typical C and C++ implementations, but don’t get standardized for a while. A big part of the older C and C++ standards are about unifying and codifying things that implementations were already doing.

link

prewett 531 days ago

I'm not sure what you mean by "imprecise types", but if you mean something like using an `int` for an array index instead of `size_t` or something, I can tell you why I do it. Using `int` lets you use -1 as an easy invalid index, and iterating backwards is a straightforward modification of the normal loop: `for (int i = max; i >= 0; --i)`. That loop fails if using `size_t`, since it is never negative. Actually `size_t` may not even be correct for STL containers, it might be `std::vector::size_type` or something. Also, I don't think I've encountered an array with more than 2 billion items. And some things, like movie data, are usually iterated over using pointers. As you say `int` is easy to type.

Also, for something like half my programming life, a 2+GB array was basically unobtainable.

link

tehjoker 530 days ago

By precise, I meant more the byte width (uint32_t vs uint64_t etc). The other kinds of types help you track what the purpose of something is, but don't really assist with correctness at the machine level.

In my work, I have a lot of data that is > 2GB, so int32_t vs uint32_t is very meaningful, and usually using a uint32_t is just delaying upgrading to int64_t or uint64_t.

Going in the other direction, a point cloud can usually be represented using 3 uint16_t and that saves a lot of memory vs using uint32_t or uint64_t.

link

wat10000 530 days ago

If you want an index that can go negative, then the right type is ssize_t, not int.

link