| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by leiroigh 459 days ago

The main problem with that is that it doesn't play nice with most languages. Consider

  int foo(int* ptr) {
    int x = ptr[1<<16];
    *ptr += 1;
    return x + ptr[1<<16];
  }

Compilers/languages/specs tend to decide that `ptr` and `ptr + (1<<16)` cannot alias, and this can be compiled into e.g.

  foo(int*):
        mov     eax, dword ptr [rdi + 262144]
        inc     dword ptr [rdi]
        add     eax, eax
        ret

which gives undesired results if `ptr` and `ptr + (1<<16)` happen to be mapped to the same physical address. This is also pretty shit to debug/test -- some day, somebody will enable LTO for an easy performance win on release builds, and bad code with a security vuln gets shipped.

2 comments

scottlamb 459 days ago

I don't think that's a fundamental problem. In say Rust (with its famously strict aliasing requirements), you obviously need some level of unsafe. You certainly want to ensure you don't hand out `&mut [T]` references that alias each other or any `&[T]` references according to either virtual or physical addresses, but that seems totally possible. I would represent the ring buffer with a raw pointer and length. Then for callers I'd construct `&[T]` and `&mut [T]` regions as needed that are never more than the full (unmirrored) length and thus never include the same byte twice. There are several existing Rust crates for the mirrored buffer that (though I haven't looked into their implementations recently to verify) presumably do this: slice-deque, vmcircbuf, magic-ring-buffer, vmap.

I do think though there are some downsides to this approach that may or may not be deal-breakers:

* Platform dependence. Each of the crates I mention has a fair bit of platform-specific `unsafe` code that only supports userspace on a few fixed OSs. They fundamentally can't work on microcontrollers with no MMU; I don't think WASM has this kind of flexibility either.

* Either setting up each buffer is a bit expensive (several system calls + faulting each page) or you have to do some free-listing on your own to mitigate. You can't just rely on the standard memory allocator to do it for you. Coincidentally just like last week I was saying freelisting is super easy for video frames where you have a nice bound on number of things in the list and a fixed size, but if you're freelisting these at the library level or something you might need to be more general.

* Buffer size constraints. Needs to be a multiple of the page size; some applications might want smaller buffers.

* Relatedly, extra TLB pressure, which is significant in many applications' performance. Not just because you have the same region mapped twice. Also that the buffer size constraints mentioned above make it likely you won't use huge pages, so on e.g. x86-64 you might use 4 KiB pages rather than 2 MiB (additional factor of 512x) or 1 GiB (additional factor of 262144x) as the memory allocator would help you do if they could be stuffed into the same huge page as other allocations.

link

dzaima 458 days ago

Rust doesn't help here; you necessarily must do all stores in potentially-mirrored memory as volatile (and possibly loads too), else you can have arbitrary spooky-action-at-a-distance issues, as, regardless of &[T] vs &mut [T] or whatever language-level aliasing features, if the compiler can see that two addresses are different (which they "definitely" are if the compiler, for one reason or another, knows that they're exactly 4096 bytes apart) it can arbitrarily reorder them, messing your ring buffer up. (and yes it can do so moving ops out of language-level lifetimes as the compiler knows that that should be safe)

vmcircbuf just exposes the mutable mirrored reference, resulting in [1] in release builds. Obvious issue, but, as my example never uses multiple references with overlapping lifetimes of any form, the issue would not be fixed by any form of more proper reference exposing; it's just simply the general issue of referencing to the same data in multiple ways.

vmap afaict only exposes push-back and pop-front for mutation, so unfortunately I think the distance to cross to achieve spooky action in practice is too far (need to do a whole lap around the buffer to write to the same byte twice; and critical methods aren't inlined so nothing to get the optimizer to mess with), but it still should technically be UB.

slice_deque has many open issues about unsoundness. magic-ring-buffer doesn't build on modern rust.

[1]: https://dzaima.github.io/paste/#0TVDBTsQgFLz3K56XbptsWlo1MWz...

link

scottlamb 458 days ago

> Rust doesn't help here; you necessarily must do all stores in potentially-mirrored memory as volatile (and possibly loads too), else you can have arbitrary spooky-action-at-a-distance issues, as, regardless of &[T] vs &mut [T] or whatever language-level aliasing features, if the compiler can see that two addresses are different (which they "definitely" are if the compiler, for one reason or another, knows that they're exactly 4096 bytes apart) it can arbitrarily reorder them, messing your ring buffer up.

Hmm, as I think about it, I see your point about LLVM's optimizer potentially "knowing" memory hasn't changed that really has if it inlines enough even if it's never put into the same &mut [T] as the other side of the mirror (and two improperly aliased &mut [T] are never constructed).

But as an alternative to doing all the stores in a special way (and loads...don't see how doing a volatile store to one side of the mirror is even sufficient to tell it the other side of the mirror has changed)...it'd be far more practical if the caller could use a (not mirrored) &mut [T]. Couldn't you have an std::ops::IndexMut wrapper that returns a guard that has a DerefMut into &mut [T] and on Drop creates a barrier for these kinds of optimizations via `std::arch::asm!("")`? [1] Then LLVM has to assume all memory changed in that barrier.

Regarding the more specific crate issues: I found these crates a while ago and hadn't looked extensively in their implementation. Thanks for pointing these out; I will have to look more closely if/when I ever decide to actually use this approach. I was leaning toward no anyway because of the other factors I mentioned. As an alternative, I was thinking of having a ring buffer + a little extra bit at the end that is explicitly copied from the start as needed. The maximum length of one message I need a contiguous view of is far less than the total buffer size, so only a fraction of the buffer would need to be copied.

> vmcircbuf just exposes the mutable mirrored reference, resulting in [1] in release builds.

Yuck, noted, clearly wrong to give the whole thing as a `&mut [T]`.

> slice_deque has many open issues about unsoundness.

I see at least couple of those, which seem to be "just" the usual unsafe-done-wrong sorts of things (double frees) rather than anything inherent to the mirrored buffer.

[1] https://stackoverflow.com/questions/72823056/how-to-build-a-...

link

dzaima 458 days ago

Yeah, an asm marked as memory-clobbering is the proper thing; not the first time I've forgotten that volatile entirely doesn't imply anything to other memory. (in fact, doing "((volatile uint8_t*)x)[0] = 0xaa;" in my godbolt link in a sibling thread still has the optimization happen). Don't know how exactly it interacts with aliasing rules; maybe you'd have to explicitly pass the mutable reference to the asm as an input, otherwise it'd be illegal for the asm to change it and so the compiler can still assume it isn't? or, I guess, not have any references live during the asm call is the proper thing.

Probably indeed possible to do it with proper guards (the pre-pooping your pants issue is probably not a problem if you also have the asm guard in drop?).

> I see at least couple of those, which seem to be "just" the usual unsafe-done-wrong sorts of things (double frees) rather than anything inherent to the mirrored buffer.

Yeah, possible. I was just saying that from the perspective of proving that all the ring buffers not taking extreme care are incorrectly implemented.

link

scottlamb 458 days ago

> Don't know how exactly it interacts with aliasing rules; maybe you'd have to explicitly pass the mutable reference to the asm as an input, otherwise it'd be illegal for the asm to change it and so the compiler can still assume it isn't? or, I guess, not have any references live during the asm call is the proper thing.

I don't know either, but really it's the opposite half of the buffer you want to tell it may have changed, so I imagine it doesn't matter even if you still have the `&mut [T]` live.

Maybe the extra guard I described isn't necessary either; the DerefMut could directly return `&mut [T]` but set a `barrier_before_next_access` on the ring, or you could just always have the barrier, whatever performs best I guess.

link

leiroigh 458 days ago

>so unfortunately

I see a fellow enjoyer of bugs ;)

>vmap afaict only exposes push-back and pop-front for mutation

what about https://doc.rust-lang.org/nightly/std/io/trait.Write.html#ty... ?

>and critical methods aren't inlined

aren't inlined explicitly. This does not mean that they are not inlined in practice (depending on build options). Also, LLVM can look inside a noinline available method body for alias analysis :(

This is a big pain whenever one wants to do formally-UB shennenigans. I'm not a rustacean, but in julia a @noinline directive will simply tell LLVM not to inline, but won't hide the method body from LLVM's alias analysis. For that, one needs to do something similar to dynamic linking, with the implied performance impact (the equivalent of non-LTO static linking doesn't exist in julia).

link

dzaima 458 days ago

> I see a fellow enjoyer of bugs ;)

Yep! :)

I did look at the assembly on a release build and the write method was in fact not inlined (needed to get the compiler to reason about the offset aliasing); that write method is what I called "push-back" there. I could've modified the crate to force-inline, but that's, like, effort, just to make a trivially-true assertion for one HN post.

Indeed a lack of an equivalent of gcc's __attribute__((noipa)) is rather annoying with clang (there are like at least 4 issues and 2 lengthy discussions around llvm, plus one person a week ago having asked about it in the llvm discord, but so far nothing has happened); another obvious problem being trying to do benchmarking.

(for reference, what I was trying to get to happen was an equivalent of https://godbolt.org/z/jobs6M95G)

link

dzaima 458 days ago

Volatile stores would fix that issue. But it does mean that it'd be unsafe to lend out mutable references to objects. (maybe you'd need to do volatile loads too, depending on model of volatility)

link