Hacker News new | ask | show | jobs
by nanolith 871 days ago
Well, specifically, it counts uninitialized variables as being set to a non-deterministic value. The point of this tool isn't to optimize functions as a compiler would, but rather to find bad behavior based on its machine model.

This isn't perfect, of course. In this case, the compiler can rightly treat x as uninitialized, meaning that its value could be <= 42 at one point, and not be <= 42 at another point. Since the uninitialized variable isn't "pinned", it could technically be different values in different locations.

CBMC's machine model works differently. In this case, x is assigned a non-deterministic value. The branch condition creates a refinement of this value within the scope of that statement. If the branch is taken, then by SMT rules, x can't equal 12345, because it was already refined as being <= 42.

On its own, a model checker can miss situations like these. It's why I recommend -Wall -Werror -Wpedantic in conjunction with CBMC. The compiler should catch this as a warning, and it should be upgraded as an error.

4 comments

> In this case, the compiler can rightly treat x as uninitialized, meaning that its value could be <= 42 at one point, and not be <= 42 at another point. Since the uninitialized variable isn't "pinned", it could technically be different values in different locations.

LLVM has a "freeze" command to stop propagation of undefined values (although I think that command was added later than the first version), so that the value is "pinned" as you say. However, the "undef" command, if you do not use "freeze", will not do this.

I think that the C compiler should "pin" such undefined values where they are used, but I don't know which compilers have an option to do this. (Perhaps CBMC should also have such a switch, so that you can use the same options that you will use with the C compiler.)

With this ex.3 file, the optimizer should be allowed to result in a function that does nothing and has an unspecified return value (which might or might not be the same each time it is executed). (If optimizations are disabled, then it should actually compile the conditional branch instruction.)

This is one reason why I explored model checking machine code output, since at this point, the behavior is very much defined, even if it differs from implied source behavior. But, this gets complicated for other reasons.
This is actually one of the reasons I've been tinkering with CBMC https://www.philipzucker.com/pcode2c/ The idea is to use Ghidra to lift a specialized C interpreter of the machine code and then compare to source or ask other questions via CBMC
This kind of analysis is excellent.

I do recommend standardizing on hardware and toolchain for projects like these, as it can ensure that the machine code is matched.

The third phase of my work will round the corner with machine code analysis. Currently, I'm working on constructive proofs and equivalence proofs of imported C code to handle the pieces that CBMC doesn't do so well, as part of my second phase. So far, I can extract efficient C from Lean 4, but I want to import C directly into Lean 4 to prove it equivalent to extracted code. Hand-written C is easier to read than the line noise I can extract.

In particular, model checking doesn't fare well with data structures, algorithms, and cryptography. These can be torn down quite a bit, and at least we can verify that the algorithms don't leak memory or make bad integer assumptions. But, if we want to verify other properties, this requires constructive proofs.

Between 80 and 95 percent of the time, depending on the type of code being written, CBMC is enough. I'm now solving for that 5 - 20 percent of code that CBMC can't easily tackle.

> model checking doesn't fare well with data structures, algorithms, and cryptography

To the contrary, that's what I'm using it for in most of my projects. It found interesting algorithmic bugs in my ctl find function (3-way comparison callback missing cases), is used to crack poor hashes. Also my tiny-regex matcher profited from it.

Also a lot of baremetal firmware.

ooooooh. link?

What do you mean by standardizing on hardware and toolchain? I am currently choosing ghidra and cbmc, and it seems like the approach is applicable to any compiler or arch that ghidra supports without too much change. I have only been doing x86-64 and gcc so far though

Sadly, I don't have a link for this yet. My goal is to write this up once I have a compelling example.

In the meantime, check out my github. https://github.com/nanolith

Currently, I'm feeding commits into libcparse, which is the start of this effort. That stream is currently about 180 days out of phase with what's in my local repositories, but it is slowly trickling in. The first step of this second phase will be to use libcparse to verify itself via constructive proofs in Lean. Currently, and by that I mean what hits in March or April, libcparse has a mostly C18 compliant lexical scanner for the C preprocessor. The goal is to have a SAX-like C18 parser that can detect all C declarations and definitions. Tooling will include a utility that imports C declarations and definitions into both Lean 4 and Coq. This is a moonlighting effort for an upcoming book I intend to write.

As for standardizing on hardware and toolchain, what I mean is that I try to get projects to commit to using specific hardware and a specific toolchain. The goal is to limit variables for automated testing and analysis.

OpenBSD, in particular, is adamant about testing their OS on real hardware for platforms that they support. Their definition of "working" is on real hardware. I think that's a reasonable goal.

It helps with analysis like this, as you don't have to worry about different compilers or different hardware. GCC, in particular, can produce different output depending on which hardware it was compiled on, unless a specific configuration is locked down. If there is a flaw in how it does vector operations on a particular Xeon processor, this won't be caught if testing locally on a Core i5 processor that produces different code.

Since any use of the variable is undefined behavior, that's what we want to be informed about.

The reasoning about nondeterministic values should be spared for situations when it's other than undefined behavior. For instance, accessing an uninitialized structure member that came from malloc isn't undefined behavior. It's (I think) unspecified behavior. In an implementation that has trap representations for some types, it could hit a trap.

That's understandable, and I would suggest reaching out to the CBMC developers. It shouldn't be difficult to add an uninitialized variable pass to the solver.

Practically speaking, -Wall -Werror should catch this. Any use of a tool like CBMC should be part of a defense in depth strategy for code safety.

It does in clang.

   $ cc -Wall -Werror bar.c
   bar.c:5:11: error: variable 'x' is uninitialized when used here [-Werror,-Wuninitialized]
      if (x <= 42){
          ^
   bar.c:4:12: note: initialize the variable 'x' to silence this warning
      int x;
           ^
            = 0

   $ cc --version
   clang version 16.0.6
It also does in gcc.

   $ gcc -Wall -Werror bar.c
   bar.c: In function 'main':
   bar.c:5:10: error: 'x' is used uninitialized [-Werror=uninitialized]
    5 |       if (x <= 42){
      |          ^
   bar.c:4:11: note: 'x' was declared here
    4 |       int x;
      |           ^
   cc1: all warnings being treated as errors

   $ gcc --version
   gcc 12.2.0
> I would suggest reaching out to the CBMC developers

For something I don't use?

How about this: I would suggest the CMBC developers read HN comments about their stuff, when it comes up.

You're right, notifying them directly would be too much effort.

The sensible way for developers to become aware of possible improvements they could make is for them to constantly scan the rest of the internet.

I do that! Some of my projects are packaged in distros. In the past, I've found discussions of problems among distro people, who didn't contact upstream at all, but just tried to patch things on their own. I fixed things on my end, and in such a way that their patch would fail to apply when they update. No words exchanged at all.

I'm not a user of CMBC. I read about it here, I wrote a comment here, and that's the end of it.

But for a minute I will bite. The submission has no link to the actual project. I found that by a web search. The project has no project trappings: no mailing list, or anything of the sort. The page says, "For questions about CBMC, contact Daniel Kroening". The name is a link to his home page. His home page has no obvious contact information. There are links to papers; probably his e-mail address is given in some of them. Let's click on "A Tool for Checking ANSI-C Programs". Hey, e-mail links in that page, before we even get to the PDF. But ... from 2004. Is k-----g@cs.cmu.edu from two decades ago going to work, given that he's at Oxford?

Suppose I hunt down his current e-mail address. What do I say? "Dear Sir, I feel really awkward to be contacting you about your CBMC program, which I have never actually used, but believe me when I say I have a valuable suggestion ..."

If I used the thing, I'd fix it myself and send a patch!

https://github.com/diffblue/cbmc The github repo is active and they are very responsive to issues. As you say, it is unneccesary and probably undesirable to post an issue for a tool you don't use. The links for this are multiple places in the submission. The state of the cprover website is unfortunate and represents the state of the project unfairly imo.
FWIW I have had luck mailing old research emails, I read my 20 year old university email. It is one of the best signal to noise ratios in communications I have.

I would not mail if it is only about unintialized variables.

People can comment on a project in public discussion forums without then also having a responsibility to become a contributor to the project.
I certainly agree -- but then they should not expect anything to be done about whatever they're complaining about.

I interpreted kazinator's original comment as essentially saying "This looks like something I might use myself, but for this dealbreaker issue."

>Since any use of the variable is undefined behavior

use of the variable as an lvalue is not undefined

When I say that a variable is used, I mean that its value is accessed. This is in line with compiler jargon. In this basic block, the variable x has no next use at the first line; the assignment of 42 is a dead assignment.

  x = 42
  y = w
  x = 43
Search the web for "next-use information".
Recall trying to find some fun bugs caused by uninitialized memory, which was made harder to find due to the fact that the debug runtime zeroed memory allocations while the release runtime did not.
The declaration didn't reserve the address? What else is the point of having a declaration?

I don't think accepting the random initial value means that the value may continue to change until the first time you set a value.

A declaration doesn't reserve an address. It just means that "x is a variable of type M." How this actually works depends on the compiler.

Weird things happen to uninitialized variables during optimization. Basically, nothing can be assumed about the value, because using an uninitialized variable results in undefined behavior. The original un-optimized generated code may very well have some default value, maybe. Probably not. But, the optimizer can decide that this value is represented by register R0 here and R4 there, and since the variable is uninitialized, there's no point in doing a spill or a load. It really comes down to how the optimization passes were written.

I'll take your word for it, but I would say there is a point, which is that the symbol was declared.

And I'm not talking about a default value, this doesn't imply a default value even though assigning an address will come with some value. But I do get that defining the name and type are a seperate thing that don't necessarily require assigning an address any more than assigning an a value.

The difference is, any default value would be arbitrary. 0 is just a value that has a meaning that the compiler can't know. It's almost as arbitrary as 37.

But picking the next available address to attach to the symbol is not like that.

The compiler always picks that itself and the programmer always has to ask what it is when they want to know it. The compiler can safely do that much right at declare time without making any assumptions. The value is still random, which is fine because 0 is almost the same as random, in that it's random if 0 would actually be the best default value for this particular item vs 1, -1, size, -size, size/2, etc.

> But picking the next available address to attach to the symbol is not like that.

> The compiler always picks that itself

One of the best analogies I heard, years ago on IRC, is that good source code is like select cuts of meat and fat, and the compiler is like a meat grinder. What comes out of the meat grinder is hamburger. It bears little resemblance to the source code, syntax, or semantics. But, it is meat.

I'll talk about register machines with a stack, since code is generated on most architecture this way. But, that's not necessarily a thing that can be relied upon. The compiler will preserve "relevant meaning" once the optimization passes are finished. Side-effects and return values (if used by the caller in the case of LTO) may be preserved, but that's largely it. That being said, let's talk about what a variable is, from the perspective of a compiler. A variable is decomposed into a value, which may be referenced, may be changed, or may need to be preserved. The optimizer aggressively strips away anything that isn't immediately needed. If the compiler needs to preserve the value of the register, then it can choose several ways of doing this. If the value is fixed, then it may just become a constant in code. Fixed values can also be produced synthetically. If the value is variable that is set at runtime and referenced later, then it could be set in one register and transferred to another register. If neither is possible, then the value could be spilled to the stack. But, this spill location is not necessarily assigned to this variable. This spill location could be used for another variable later in the function when this one falls out of scope. Spill locations aren't necessarily fixed. On many ABIs, there are scratch registers, registers that must be preserved between calls (the called function must preserve them if used), and registers that can be overwritten after a call. How a value is preserved depends entirely on whether it must be preserved, and which is the most efficient way to do so.

If a variable in source code is referenced but never assigned, then the optimizer can optimize away any "storage" (i.e. stack or a register) for this variable. When the variable is "read", the value used is arbitrary, if at all. The "read" has no meaning and has been optimized away. The value isn't fixed to a random value either at compile time or at runtime. It's whatever flotsam happens to be at that particular place in code at that particular time. If the value is "read" once in one location and once again in another location, even on the next line of source code, these values could be the same or entirely different. There's no telling what the compiler will do, because the optimizer is really free to interpret this in any way. The only way to ensure that a value is preserved by the optimizer is to explicitly assign this variable a value in the source code. This assignment is respected, assuming that the variable is referenced afterward. Beyond that, we can't make any meaningful assumptions.

thank you