Hacker News new | ask | show | jobs
by AndyKelley 1203 days ago
A lot of people have the same misunderstanding as you.

LLVM has rules about what is legal and what is not legal. If you follow the rules, you get well-defined behavior. It's the same thing in C. You could compile a safe language to C, and as long as you follow the rules of avoiding UB in C, everything is groovy.

Likewise, this is how Zig and other languages such as Rust use LLVM. They play by the rules, and get rewarded by well-defined behavior.

1 comments

Is not one of the LLVM rules, pointers must be valid and have a valid provenance in order to be dereferenced? If 0x2 ends up in a pointer that is dereferenced (or 0x0 in a nonnull pointer), has that rule not been broken? And if the rule is broken, does that not trigger undefined behavior?
I invite you to share a snippet from the LLVM language reference[1] that backs up your interpretation.

I will return the courtesy, with regards to my interpretation:

> An integer constant other than zero or a pointer value returned from a function not defined within LLVM may be associated with address ranges allocated through mechanisms other than those provided by LLVM. Such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM. [2]

[1]: https://llvm.org/docs/LangRef.html

[2]: https://llvm.org/docs/LangRef.html#pointer-aliasing-rules

From the same section,

- Any memory access must be done through a pointer value associated with an address range of the memory access, otherwise the behavior is undefined.

- A null pointer in the default address-space is associated with no address.

A null pointer (0x0) is associated with no address, therefore it has no address range. So if you do attempt a memory access (dereference), the behavior is undefined. QED. A naive translation to assembly would indeed segfault on a modern OS, but LLVM's optimizations are free to assume that code path is unreachable and do anything else.

Once the program is in this state, a bug of some kind is unavoidable. I don't take issue with that - what I take issue with is your claim that this behavior is well-defined, because it definitely is not. It would be equally valid for a null dereference to corrupt your program state or wipe your hard disk.

You have already admitted that 0x1, 0x2, etc. are fine. Your remaining argument rests entirely on the incorrect premise that Zig's only option is to lower to LLVM IR using the default address space.
I don't think 0x2 is a valid pointer either. The docs say the pointer value must be "associated with address ranges allocated through mechanisms..." - to me the word "allocated" means it's the result of an allocation, pointing at usable address space. (Sorry, I know this is a purely semantic argument. Debating the meaning of words does not make for very interesting discussion.)

In Rust for example, derefencing a raw pointer is unsafe - because that pointer could have a value of 0x2 - which would result in undefined behavior according to LLVM.

tbh I'm surprised any of this is even up for debate. If you google "is segfault undefined behavior" you'll get 100 results telling you yes, yes it is.

Are you claiming that any program that segfaults exhibits undefined behavior within LLVM semantics, even those that were not compiled by LLVM? Or within some other set of semantics shared by all programs that can segfault?