Hacker News new | ask | show | jobs
by dzaima 696 days ago
It would be neat to have non-assembly options for things like this. A "load with unspecified elements for any values past the end of the allocation, UB only if the hardware doesn't like it" thing shouldn't be hard to support, even if just as an alias for the respective assembly invocations.

Additional neatness would be being able to request a guarantee that all allocations - malloc, stack, constants - have at least, say, 64 bytes of non-faulting addresses after them, though that is significantly more complex, requiring cooperation between a bunch of parts.

Annoying thing is that this is trivial with a custom allocator (as long as the compiler isn't told to consider the custom sub-allocations as separate), but then you're stuck not being able to use your SIMD stuff on anything outside your custom heap due to the very tiny chance of segfaulting.

Sanitizers/valgrind don't necessarily become pointless with this even - the past-the-end values are still undefined, can be tracked as such, and error on use.

4 comments

The sanctioned way to this would be masked aligned load intrinsics, alignment avoids page faults, masking avoids reading undef bits, being an intrinsic conveys the intent to the compiler so it'll know that this is not an OOB read.

The other option that I've seen discussed is adding a freezing load to LLVM that turns the undef bits into some unspecified but valid bit patterns.

> A "load with unspecified elements for any values past the end of the allocation, UB only if the hardware doesn't like it" thing shouldn't be hard to support

Not an expert, but to me this sounds like you want an alternative where behaviour for a read beyond the end of an allocation is merely implementation-defined, not undefined. That means the implementation (e.g. LLVM) has to document what they do — which may be platform-dependent — and the choice of whether it becomes undefined is up to the implementation.

The natural thing to do here for the implementation is of course to say "I'm just going to emit the load instruction, it may crash your program, better be prepared".

Here it'd be perfectly fine to define it as "completely arbitrary bits past the end, potentially even differing between back-to-back calls of loading the same memory"; specific backends will end up refining that of course. In LLVM those bytes would behave as freeze(poison).
Not every platform in existence will return data when asked to access stuff out of bounds, even when sufficiently aligned. So you wouldn't want to bake into the standard that valid bits must be returned; you'd want to allow crashing, in the standard. An implementation might then define that for suitably aligned addresses, data will be returned (just not necessarily sensible data).
It should still be with "UB only if the hardware doesn't like it", of course. If weird funky hardware not following usual memory paging is of worry, providing a "memory_protection_granularity" constant is trivial, to be used instead of the page size for the check (and said funky hardware could set it to 1, thus always failing).

Alternatively, a different API would be returning an optional of the loaded data, having the stdlib/language/backend convert that to the appropriate boundary check (or always returning a None if impossible).

Ideally there'd be languages that can be at least configured into providing more "unsafe" useful things, even if at the expense of not having the code be compilable targeting funky hardware that noone would run the software in question on anyway.

What about tools like ASAN? I want it to be able to tell me if I read a single character out of bounds. Tools like ASAN can't do this if the language gets rid of undefined behavior. The reason why undefined behavior is undefined is because it's such a degenerate state for a program to exist in that any attempt by a language to imbue it with a particular blessed meaning is, to put it politely, crazy; like trying to prove a theorem that's allowed to have some contradictions.
Indeed, if you an want immediate error on every out-of-bounds read, this won't be suitable. I do think one should always have the option to not opt into this. But there still exist use-cases where the benefit of being able to do partially-past-the-end loads would significantly outweigh this downside.

That said, clang's MemorySanitizer, and, similarly, valgrind, could still produce errors via tracking which bytes are undefined within registers; might be somewhat delayed between load and error, but still shouldn't allow such out-of-bound values to be used for much.

And, anyway, as this load would be a separate instruction/builtin (if so decided), UB of regular operations is unaffected. If the sanitizer in question doesn't track (partial) register definedness, it could just accept all of these explicitly-potentially-OoB loads; indeed not ideal, but the alternative is not being to write such performant code at all.

And there are already people doing this, just limited to doing so with data within a custom allocator. It would just be nice to have a mechanism to not be fully ruled out of using standard tooling at least for testing.

Simplest solution and the one I use is all SIMD related buffers use a custom allocator(actually everything uses it) and it always rounds the allocation size up to the SIMD width.

Masked loads kinda suck, they are a tiny bit slower and you now need a mask and you need to compute the mask..

This is what I do too (in my case I don't round up the allocation size and just let loads & stores potentially see the next object (doing tail stores via load+blend+store where needed; only works if multithreaded heap mutation isn't required though)).

The one case it can be annoying is passing pointers to constant data to custom-heap-assuming functions - e.g. to get a pointer to [n,n-1,n-2,...,2,1,0] for, say, any n≤64, make a global of [64,63,...,2,1,0] and offset its pointer; but you end up needing to add padding to the global, and this materializes as avoidable binary size increase as the "padding" could just be other constants from anywhere else. Copying the constant to the custom heap would be extra startup time and more memory usage (not sharable between processes).

"UB only if the hardware doesn't like it" sounds like you want to shift the complexity from the developers who know the problem domain best to the packagers.

As soon as the thing is packaged to run on an raspberry or something else that doesn't like it, it will start to generate CVEs and be a major pain.

This shouldn't ever be a security vulnerability, outside of perhaps denial of service from segfaults (though I'm pretty sure you'd find hardware with no page faults before finding one with pages less than 4KB; and of course, if you wanted to not be hard-coding 4KB, a compiler providing a "minimum page size" constant for the target architecture should be possible, and could return 1 on page-less hardware). But, yes, as with many optimizations, getting them wrong could end up badly.
For the case of specific vector extensions that imply specific cache line sizes, and loads that do not span multiple cache lines, I don't think you could run into issues.