Hacker News new | ask | show | jobs
by nlewycky 849 days ago
> -Wub # Warn _anytime_ there is detected potential undefined behavior, irrespective of if there is an associated optimization.

Can I ask you, have you tried any existing tools? Coverity static analysis, Klocwork, PVS-Studio, clang static analysis, tis-interpreter, Frama-C? What did you think of those? If not, why not (how important is the problem to you)?

My understanding of these tools is that they start by marking every spot potential UB could happen -- every add is potentially overflowing, every pointer dereference is potentially null or freed or whatnot, and then they use solvers to prove that the UB does not occur, and print out the rest. The benefit they have is that they can examine more than one file at a time (the compiler may only look at one .c file at a time) and they have permission to take much longer than compiling.

> -Wubelim # Warn any time code is eliminated as a result of undefined behavior / assumptions.

This doesn't happen, the compiler doesn't detect your UB and use that to delete your code. Consider this:

    int x = 4;
    int y;
    int *p = &y - sizeof(int);
    *p = 7;
    printf("%d", x);
The compiler sees 'x' mentioned in two places, once where it's defined, and once where it's used (picture the compiler building up a graph of places a values is set (definitions) and places the value is used, the use-def graph) and replaces the print with "printf("%d", 4);", then since 'x' is dead it can be deleted entirely. The rest of the code with 'y' and 'p' executes exactly in the way the programmer wrote it, we keep 'y' on the stack, and make 'p' a pointer out of bounds by computing the address that is 4 below 'y' and writing sizeof(int) bytes representing the value 7 there. We don't really go out of our way to detect UB.

Another way to think about it is that the assumptions we make about your program being free of UB are completely indistinguishable from all the rest of the correct and working code. "int x = 4;" should declare a new variable, named x, with an int's worth of memory, initialized to the value 4. That is precisely as true to the compiler as any UB-performing, code. When you write "p->xcoord" you are telling the compiler that 'p' is a valid pointer to an object of its type at this moment, and it believes you. Trust the programmer, and all that.

> -Wub... # Any other classes of UB optimizations that change the program as (incorrectly) written.

"UB optimizations" isn't a thing. It just isn't. The optimizations never change the program, at least, not unless the compiler is buggy. The compiler's job is to find some assembly which meets the specification we call the program. With the optimizer enabled, we spend more time so that we can select assembly that minimizes a cost model we have for the execution time on the underlying machine (or sometimes file size).

> Again, the goal is to provide feedback that improves the program and possibly educates / reminds the programmer about how their meanings might be misunderstood.

FWIW we agree on the goal.

The model for warnings in clang at least has been to look at the code as it is typed, and focus on errors that programmers make. We have all kinds of complex rules for warnings, like "if (3 < 4)" issues a warning (-Wtautological-compare) but "if (MAX_THREADS < MAX_CORES)" with #define MAX_THREADS 3 and #define MAX_CORES 4 doesn't. We've put a ton of effort into getting this sort of thing right, and that includes warnings that code will always produce UB when run, even if it was expanded through macros or templates. It's not an exhaustive system, the warnings work was guided by actual bugs we've encountered in real systems.

There might be another way to do this. The C++ constexpr feature has the compiler evaluate some functions at compile time and detect any UB they encounter as they run. The clang implementation of this can also handle working with values that are not known at compile time, and working with dynamic allocations. One could try to run every function with the constexpr evaluator and see whether it does a better job at producing good warnings, then remove the redundant warnings (made by pattern matching on the AST) and see if the result is fast enough to use as part of regular compilation.

2 comments

>int *p = &y - sizeof(int);

Funny how arguing about UB shows that even compiler writers can miss when they are juggling with lots of sharp objects :) There is no way you meant to write integer pointer value - 4 if your intent was 'next to it'. And double the fun is that some future hardware may do sizeof(int) == 1 and have it running 'just fine' :)

As gamedev engine person doing bugs all the time I wish we had some middle ground but a lot of smart people do not want to find it. I guess it is cheaper that way when we just build on the existing C code and hope for the best.

I've forgotten if 'address of' ( & ) is signed or unsigned. Worse, in a quick search online I can't seem to find anyplace that mentions what the return type of address of is, just a bunch of basic working with pointers pages.

The pointer math line should still be legal, even if it would be 'unsafe' in some popular other languages. Useful for determining where to write to memory mapped files or IO. Not so useful in the above case.

The warnings/errors I'd expect would be: *p = 7; // out of known bounds write

> I've forgotten if 'address of' ( & ) is signed or unsigned.

Neither, pointers are their own types which have no sign, only integral types come in signed and unsigned. There exist intptr_t (signed) and uintptr_t (unsigned) which are integral types you can losslessly cast a pointer to. Also the difference of two pointers is a ptrdiff_t which is a signed integral type.

> The warnings/errors I'd expect would be: *p = 7; // out of known bounds write

In that regard I picked a bad example, this code always executes undefined behaviour, so this could indeed be detected by a compiler warning at the cost of doing a simulated execution of the code. Most real problems come about where the code is UB only for certain runtime inputs.

The reason I chose that example was to make a point about how the compiler doesn't realize it's doing anything to "change" the program, and isn't going out of its way to optimize based on undefined behavior. It assumes that a local variable's value can't change with it being assigned to, so how could it know whether to issue the warning?

Another way to put it is:

    void written_in_asm_in_another_file();
    void test() {
        int x = 4;
        written_in_asm_in_another_file();
        printf("%d", x);
    }
The function written in asm might go up the stack to find caller's stack frame and search for the 0x00000004 and change it to a different value. Is the compiler forbidden to replace printf("%d", x); with printf("%d", 4);? Is the compiler allowed to, but required to emit a warning? Are we required to have a 4 on the stack as opposed to keeping it only in registers?

> The pointer math line should still be legal, even if it would be 'unsafe' in some popular other languages.

I'm not sure what you mean by "should", you might be suggesting a change to C or you might be stating what you think the code currently does. Right now in C, the very creation of an invalid pointer is UB whether you use the thing or not. I'm told this is because of very old CPU designs that had distinct pointer and integer registers and loading an invalid address into the pointer register would trap.