Hacker News new | ask | show | jobs
by tomsmeding 697 days ago
> A "load with unspecified elements for any values past the end of the allocation, UB only if the hardware doesn't like it" thing shouldn't be hard to support

Not an expert, but to me this sounds like you want an alternative where behaviour for a read beyond the end of an allocation is merely implementation-defined, not undefined. That means the implementation (e.g. LLVM) has to document what they do — which may be platform-dependent — and the choice of whether it becomes undefined is up to the implementation.

The natural thing to do here for the implementation is of course to say "I'm just going to emit the load instruction, it may crash your program, better be prepared".

1 comments

Here it'd be perfectly fine to define it as "completely arbitrary bits past the end, potentially even differing between back-to-back calls of loading the same memory"; specific backends will end up refining that of course. In LLVM those bytes would behave as freeze(poison).
Not every platform in existence will return data when asked to access stuff out of bounds, even when sufficiently aligned. So you wouldn't want to bake into the standard that valid bits must be returned; you'd want to allow crashing, in the standard. An implementation might then define that for suitably aligned addresses, data will be returned (just not necessarily sensible data).
It should still be with "UB only if the hardware doesn't like it", of course. If weird funky hardware not following usual memory paging is of worry, providing a "memory_protection_granularity" constant is trivial, to be used instead of the page size for the check (and said funky hardware could set it to 1, thus always failing).

Alternatively, a different API would be returning an optional of the loaded data, having the stdlib/language/backend convert that to the appropriate boundary check (or always returning a None if impossible).

Ideally there'd be languages that can be at least configured into providing more "unsafe" useful things, even if at the expense of not having the code be compilable targeting funky hardware that noone would run the software in question on anyway.

What about tools like ASAN? I want it to be able to tell me if I read a single character out of bounds. Tools like ASAN can't do this if the language gets rid of undefined behavior. The reason why undefined behavior is undefined is because it's such a degenerate state for a program to exist in that any attempt by a language to imbue it with a particular blessed meaning is, to put it politely, crazy; like trying to prove a theorem that's allowed to have some contradictions.
Indeed, if you an want immediate error on every out-of-bounds read, this won't be suitable. I do think one should always have the option to not opt into this. But there still exist use-cases where the benefit of being able to do partially-past-the-end loads would significantly outweigh this downside.

That said, clang's MemorySanitizer, and, similarly, valgrind, could still produce errors via tracking which bytes are undefined within registers; might be somewhat delayed between load and error, but still shouldn't allow such out-of-bound values to be used for much.

And, anyway, as this load would be a separate instruction/builtin (if so decided), UB of regular operations is unaffected. If the sanitizer in question doesn't track (partial) register definedness, it could just accept all of these explicitly-potentially-OoB loads; indeed not ideal, but the alternative is not being to write such performant code at all.

And there are already people doing this, just limited to doing so with data within a custom allocator. It would just be nice to have a mechanism to not be fully ruled out of using standard tooling at least for testing.

What's wrong with assembly? What's wrong with aligning a pointer and turning the sanitizer off if need be? If you're making machine specific assumptions then you should be programming against the machine rather than the language.