Hacker News new | ask | show | jobs
by ycmbntrthrwaway 3461 days ago
The result you get with this trick is signed, while the result you get with sizeof is unsigned.

Edit: Just to clarify, what you get is ptrdiff_t instead of size_t. So if array size is greater than PTRDIFF_MAX, you get undefined behavior [1].

[1] http://en.cppreference.com/w/c/types/ptrdiff_t

3 comments

As far as I know, every compiler is badly broken with arrays greater than SIZE_MAX / 2, so this would be the least of your troubles.
Since I have it at hand, here is a list of examples of how compilers are broken if an array is larger than SIZE_MAX / 2 (which is called PTRDIFF_MAX in the post):

http://trust-in-soft.com/objects-larger-than-ptrdiff_max-byt...

Is there a circle in hell reserved for C standards committee members who add to the number of cases where 'undefined behavior' occurs in the standards?
Just how would you make this case defined? The alternatives to undefined behavior tend to be (1) being silent about the issue and letting users and implementers find out themselves (2) defining the behavior in a way that makes it difficult to implement on some architectures or (3) defining the behavior in a way that imposes costs on all architectures. None of these is particularly attractive.
I would go with option 1 because it does not suggest the problem is solved. As I write this, I realize that there is a counter-argument in that introducing ptrdiff_t gives you a place to issue a warning.
Read "undefined behavior" as "depending on architecture and compiler". It's not like anything can happen, but it's simply not to describe every architecture and every compiler into a standard. Sure, somebody is free to write an implementation where a nuke is launched every time "undefined behavior" is encountered, and they would be right according to C standard, but in real world, you pretty much know what to expect on a given system.
> "depending on architecture and compiler".

The standard also has the idea of "implementation defined" behaviour, which is close to the definition above. "Undefined behaviour" is a trickier beast, since compilers can rightly assume undefined behaviour never occurs, and optimise accordingly.

Not at all - that would be implementation defined behavior.

These days, compilers quite often speculate on undefined behavior, generating code as if the undefined part cannot happen - the result is that your code is going to do stuff you pretty much can not know or expect.

I would really love if undefined behaviour meant defined by architecture, but current compiler writers are firmly in the launch nukes camp.
I hope not. One of the reasons for allowing undefined behaviour is to allow room for compilers to perform certain optimizations that may not be possible if a specific result were required.
How likely do you run into array bigger than 2gb?
The "how likely is it, really?" response to questions of technical correctness has always bothered me. It takes a mindset completely alien to mine to say "Here's a race condition. Sure, it's undefined behavior, but the race is narrow, so it's rare" or to say "Sure, memory allocation can theoretically fail, but in practice almost never does" or to say "fsync is too slow and most computers have batteries these days".

Software is unreliable enough as it is due to problems beneath our notice. It seems reckless to avoid fixing problems that we do notice. Sure, you could argue that rare problems are rare and that users probably won't notice them --- this attitude is penny-wise and pound-foolish, because you can't meaningfully reason about a system that's only probably correct.

Engineering is about tradeoffs. How many once-in-a-thousand bugs do you fix before you tackle the one-in-a-million? Or one in a billion? What about if it takes $10/bug to fix every 1:1000 bug and $100,000 to fix one 1:1000000 bug?

Correctness is great in theory, but in practice it's a matter of what's important.

If you are only looking at probability and cost-to-fix you are overlooking something important - the cost if/when it happens.
This is really emphasized in things like dmfea and other failure mode analysis documents or regulated industry. They want you to document the likelihood, your ability to recover from the failure, as well as the cost o the failure. You can say that you didn't want to pay for someone fixing some unlikely fail mode but that's small consultation to the people whose lives your product is ruining.
The problem you're latching on to I think is how the context for caculating a probability can vary.

If it were really as likely as, say, the sun exploding that X happened then it would be of no use to expend time on X.

BUT very often people speaking about the probability of events given suspicious constraints. While a memory allocation might not fail in most situations it will fail often in some situations. And a one-in-a-million chance is almost guaranteed when there are millions of uses.

Also worth considering that our processors are handling billions of ops per second. One in a million might be happening all the time even for one user.
That's why it's called one in a million...
One in a million happens quite often if you're processing something like ~100k requests a second.
In fact I agreed with the parent and just posted a tongue in cheek remark.

One in a million literally means that at ~100k requests a second it will happen once every 10 seconds.

But it's extremely unlikely when you're processing 10 requests a week, such as might be the case for the web server in a consumer-grade router.
One in a million isn't just a typical statement of probability, it's a colloquialism used to refer to things that never happen in practice. It's highly misleading to use in the context of computers which, due to their natures, have one in a million events occurring constantly.
My comment was tongue in cheek.
>he "how likely is it, really?" response to questions of technical correctness has always bothered me.

But the question is important in another context: language design. Why is this undefined behavior something that exists in the first place? Objects larger than PTRDIFF_MAX could just not be allowed! This avoids the problem and makes code easier to reason about, with pretty much no downside.

I like the way you're thinking, but that sort of thing probably doesn't get past a committee. "Hey we might not be able to think of an application but that doesn't mean our users won't have a legitimate reason for doing it ... Motion passed."
A few months ago I was doing FFTs on arrays larger than 4GB. Amusingly, this uncovered a bug in the LLVM optimizer: It was looking at stride lengths to figure out if accesses were independent, and truncated a 4GB stride down to 0.
I would be extremely interested to hear how you found this bug. Sounds like a difficult bug to track down, and I always learn from good debugging stories.
It was pretty easy to track down: clang38 was exiting with

    Assertion failed: (Distance > 0 && "The distance must be non-zero"),
    function areStridedAccessesIndependent, file /wrkdirs/usr/ports/devel/
    llvm38/work/llvm-3.8.0.src/lib/Analysis/LoopAccessAnalysis.cpp, line 1004.
Looking at the file it was easy to see what was being asserted, and to see that the type was a 32-bit integer; since I knew I was dealing with huge FFTs, the problem was obvious.

Let this be a lesson: Asserting that impossible things don't happen makes debugging much easier when they do happen!

Well good thing it was ReleaseWithAsserts build!
Not likely, but possible. This reminds me of the bug that was found in the binary search algorithm a few years ago, IIRC, in Java. The interesting thing is that binary search is probably one of the earliest-invented algorithms. Yet, in the book Writing Efficient Programs by Jon Bentley (which I mentioned in a recent HN comment), he says that in a class he taught to several industrial programmers with many years of experience, some had bugs in their implementations of binary search that he set them as an exercise. Not sure but I think I remember reading in the article about the Java binary search issue, that even his algorithm had the bug that was found in the Java version. Why it was not found earlier is (maybe) because it only occurred with an extremely large array, IIRC. Don't have a link right now, but it can probably be found by searching for the right phrase.
Just did a google search, and it even partially auto-completed this search for me:

bug in java binary search

and showed a related search in the drop-down, 'programming pearls ...', a book by Jon Bentley, which seems to confirm what I said above (though I saw it in his other book, "Efficient Programs", IIRC - he might have mentioned the same issue in the Programming Pearls book too).

Edit: and the Wikipedia article confirms it too:

https://en.wikipedia.org/wiki/Binary_search_algorithm#Implem...

Yes, that's it, by the desc. and dates shown.
Yes, thanks, that closely describes the assignment I read that he gave.
It's basically bogus to have a single object bigger or equal to half of address space (represented by size_t) in C. 32-bit platforms should detect and abort in such conditions (compiler/linker for static objects, malloc() implementation for dynamic allocations).
Why? If you're running a system with PAE, half of a 32-bit address space is a small fraction of the whole addressable memory.
It's either addressable or it isn't. My understanding of typical PAE systems is that userspace is still limited to 32 bits of address space per process. Any system where userspace is not limited to 32 bits should have a larger than 32-bit size_t. (PAE systems are not true 32-bit platforms.)
Today, with ML, big data and similar applications, that might be often.
Probably not very likely, but keep in mind that this method could also be used without actually allocating the array -- akin to the 'offsetof()' macro. (Which is undefined behavior.)
On a 64-bit platform (anything modern), ptrdiff_t is going to be 64-bit so this will not be an issue (ok, 63-bit... but you get my point.)
In this context, the difference between 63 bits and 64 bits is not trivial.
Both are bigger than the x64 address space...
Often enough; "pack" files in video games are often many GB. Memory-map one of those and there you are . . .
The parent is taking about >2gb on a 32-bit machine.
So what? You can have up to 64GB of RAM on a 32-bit machine: https://en.wikipedia.org/wiki/Physical_Address_Extension
But those have to be mapped a slice at a time into virtual addresses.
Oh surprisingly easily. Say you're handling a few billion cookies in RAM or manipulating DNA data.
It's quite easy to serve over 2 GB of spaces over the network. (gzip, brotli)