Hacker News new | ask | show | jobs
Bugs You'll Probably Only Have in Rust (gankro.github.io)
367 points by Gankro 3298 days ago
9 comments

One of the most important tools when writing unsafe rust is compiletest [1]. It's a tool extracted from the compiler project that lets you write tests that are supposed to fail compilation. Since safe abstractions rely on the type system to make unsafe code safe, it's critical to make sure the compiler is properly rejecting code. I wrote a post about this years ago when I got hit by one of the bugs Gankro wrote about [2].

[1]: https://github.com/laumann/compiletest-rs

[2]: http://erickt.github.io/blog/2015/09/22/if-you-use-unsafe/

> Making unsafe a big scary "all bets are off" button is only compelling if most of our users don't need to use that button. Rust is trying to be a language for writing concurrent applications, so sharing your type between threads requiring unsafe would be really bad.

It would be neat if we could decompose unsafe like so "unsafe[this_feature,that_feature] {}". The unqualified "unsafe" could still refer to a global "free reign", but you could opt-in to "only let me violate these specific rules." It would be a hint to maintainers and might help make the std lib and other core libraries be/remain defect-free.

Another interesting "oh shoot" w/unsafe that I'm curious about: when I intentionally/unintentionally alias two variables in my unsafe block, this will invalidate assumptions made elsewhere in safe code. This is my unsafe block's bug, but it seems like something that could take a good while debugging to attribute back to my unsafe block. I don't think there's a good resolution to this one other than perhaps documentation/best practices.

Re: parameterized unsafe -- I think it's been discussed and rejected, I don't remember where. I think it was mostly a matter of "yes this would be more powerful, but the complexity isn't worth it".

Note that we sort of made a "new" kind of unsafe with the UnwindSafe trait: https://doc.rust-lang.org/std/panic/trait.UnwindSafe.html

That's probably how we intend to solve these kinds of problem in the future.

Re: aliasing -- if it's a serious enough problem, one of two things will happen:

* Someone will develop a version of asan/ubsan for Rust.

* The Rust devs will be forced to reduce the extent to which they apply alias analysis by default (possibly with a flag to opt into it). At least temporarily.

The rust devs have backed off optimizations in the past when they break stuff in the ecosystem (struct layout optimization). But they also work with the affected devs to fix those bugs so they can turn the optimization on.

> Someone will develop a version of asan/ubsan for Rust.

This already happened (japaric [1]). But ASan won't save you from a bug due to optimization-because-I-assumed-these-locations-dont-alias (maybe TSan might?).

[1] https://users.rust-lang.org/t/howto-sanitize-your-rust-code/...

As you say, none of the existing sanitisers catch Rust-specific problems, which is, I assume, what the parent meant by "for Rust". That said, they will likely catch many of the consequences of such violations, just not pinpoint the cause as precisely.
A Rust-specific sanitizer has been proposed, though. See my other reply (which I was in the middle of writing when this thread popped up, so I didn't see it):

https://news.ycombinator.com/item?id=14553679

That isn't yet an existing sanitizer. :)

It is definitely an important missing piece to bridge the gap to ASan/TSan, but it's still just a proposal/work in progress, not least because, AIUI, the precise rules it needs to enforce aren't yet entirely clear.

> This is my unsafe block's bug, but it seems like something that could take a good while debugging to attribute back to my unsafe block.

You are correct: it's possible to write nefarious code inside an 'unsafe' block then only suffer its effects outside of it, and Rust has to document that fact. The Nomicon, mentioned in the blog post a whole bunch, points this out early on:

> 'unsafe' does more than pollute a whole function: it pollutes a whole module. Generally, the only bullet-proof way to limit the scope of unsafe code is at the module boundary with privacy.

[ https://doc.rust-lang.org/nightly/nomicon/working-with-unsaf... ]

> It would be neat if we could decompose unsafe like so "unsafe[this_feature,that_feature] {}"

I sometimes feel the same way, but remember that the `unsafe` keyword only unlocks four additional features:

1. Dereferencing a raw pointer

2. Calling an unsafe function or method

3. Accessing or modifying a mutable static variable (and this might conceivably even be removed entirely someday)

4. Implementing an unsafe trait

It's unclear to me how to make this any more fine-grained such that annotating the "kind" of unsafe you're using would be useful and enforceable by the compiler (which is crucial, because otherwise why not just use a comment?).

In practice I think really the only "distinction" in unsafe Rust that I want is the ability to distinguish unsafe blocks that exist only to call external C code.

> 3. Accessing or modifying a mutable static variable (and this might conceivably even be removed entirely someday)

Mutable static variables removed or the unsafety of accessing them? Didn't Rust, at one early point, not allow mutable global variables?

`static mut` specifically. To elaborate, we obviously can't just go and remove it now due to our backwards-compatibility promise (at least not without a long deprecation period and a breaking major language version bump). Furthermore, even if we wanted to we actually can't completely replace `static mut` just yet: the intended replacement (using normal (non-`mut`) `static` variables that have `UnsafeCell`s in them) isn't completely usable until our constant-evaluation story is fleshed out further. And the unsafety would still be present one way or the other, but this would allow us to make the language simpler and make it a bit easier to explain the `unsafe` keyword (it would be a general extension of our policy to push complexity out of the language and into libraries whenever possible, which we believe makes the implementation easier to audit and results in a safer and more reliable language).
One of the compiler team members advocated for removing static mut entirely, but it didn't quite happen before 1.0. It's totally feasible to do so if const fn was stabilized, but it's not, so...
:(
> Another interesting "oh shoot" w/unsafe that I'm curious about: when I intentionally/unintentionally alias two variables in my unsafe block, this will invalidate assumptions made elsewhere in safe code. This is my unsafe block's bug, but it seems like something that could take a good while debugging to attribute back to my unsafe block. I don't think there's a good resolution to this one other than perhaps documentation/best practices.

As part of the unsafe code guidelines effort, there's been talk of adding an 'unsafe checker' mode to rustc, analogous to valgrind or Clang's AddressSanitizer, which would alter code generation to add checks that could catch many classes of incorrect behavior at runtime. (This would have a high performance cost and would be intended as a debugging tool.) One of the things it would probably do is keep a global map of all live references, and complain if references are created that break the rules, e.g. a mutable reference is created to something that already has a (mutable or immutable) reference somewhere else in the program. Thus it could catch the kind of bug you mentioned.

Of course, you would have to remember to run the checker, and as a dynamic rather than static analysis it would only catch errors that are actually exhibited at runtime (so it probably wouldn't catch the MutexGuard example from the original blog post, unless there was some real code that raced on a MutexGuard). Still, in practice it should help a lot with ensuring that unsafe code doesn't break the rules.

Edit: Niko talked about this in a blog post in February. He proposes a somewhat more complex tracking system than the global list of references I mentioned:

http://smallcultfollowing.com/babysteps/blog/2017/02/01/unsa...

At Standard Chartered our in-house Haskell dialect did a tiny bit of that: it distinguishes between ReadIO and general IO. ReadIO is meant to be idempotent operations only.
unsafe with features sounds like monad transformers in haskell.

if implemented it would probably outperform haskell (given that monad transformers in haskell have runtime overhead, unfortunately).

So happy that Gankro is back writing things about Rust, and especially delighted to hear that the Rustonomicon is going to be fleshed out more. :)
If you—like me—were interested in Diesel ORM's zero sized types thing, here's a pretty decent explanation: https://np.reddit.com/r/rust/comments/3ur9co/announcing_dies...

edit: Go also has zero-sized types (struct{}), so I wonder if this is also possible? Probably not, I don't think, since the compiler doesn't see through interfaces.

If you're interested in more of this, Sean, the author of Diesel, is giving a talk @ RustConf in August (http://rustconf.com/program.html#sean) which will probably cover these tricks in more depth :)

[Talks will be recorded, but also tickets are on sale right now if you want to be there in person!]

> Go also has zero-sized types (struct{}), so I wonder if this is also possible?

No. It specifically uses Rust's generics system, and the fact that generics are monomorphized at compile time, whereas Go interfaces are not.

C++ templates can be used in similar ways.

Yes, but differently; C/C++ explicitly prohibit zero-sized types due to object identity.
To clarify: C says it's UB, C++ rounds up to 1.
I have to say, these RCA's of the various bugs are great for getting a better understanding of the internals of the language.

In a lot of ways it makes me trust Rust even more, because there is a deeper understanding of exactly how these guarantees are made.

Question for the rust folks - are there any features that wouldn't have been possible without "unsafe"? That is, if rust never had unsafe, would it have been fundamentally limited in any way? Or is it required for e.g. interoperability with C?
C and other FFI is fundamentally outside the control of the Rust compiler (or any other non-C compiler) and, furthermore, the foreign functions can have arbitrarily complicated preconditions (or plain bugs) that lead to memory safety violations. This means that, whether it is marked or not, these operations are semantically `unsafe`, as in, they risk memory unsafety because the compiler can't guarantee it. This does mean that all languages with FFI have safety holes.

Additionally, not having the facility for low-level/unchecked code just means that things like optimised data structures/memory management/hardware interaction get implemented either in the compiler or in other languages. The former is much harder to reason about and to modify: one is essentially writing code that generates compiler IR, which is more annoying and error prone that both just writing the code directly and just writing the IR directly (one way to think about this is the compiler is one big `unsafe` block). The latter is unfortunate because it results in impedence mismatches when doing the FFI calls both semantically and with performance, and it also means that code doesn't get to benefit from the usual Rust safe checks and high level features (like ADTs) that are all still available inside `unsafe` blocks.

I'll give you the shortest example: in order to build an operating system in Rust for x86, you need to do this:

  let p = 0xb8000 as *mut u8;
VGA drivers use the memory mapped at 0xb8000 to drive the device. This creates a pointer, p, at that address.

In order to demonstrate this is safe (okay so unsafe isn't in this example, creating p is safe, but writing to/reading from it is not), a language would have to know:

1. That your code is running in kernel mode, that is the entire concept of ring 0 vs ring 3.

2. That the VGA spec specifies that location in memory.

Yeah, in _theory_, you could have a language that does this, but that'd tie your language so, so, so deeply to each platform, that it's not feasible.

This can be extrapolated to all kinds of other low-level things.

> That your code is running in kernel mode, that is the entire concept of ring 0 vs ring 3.

That need not be the case though. You could have a kernel side allocator that sets up the MMU to map that memory to a pointer that you return which lives in the space of the process. The MMU would take care of the required arithmetic to access the memory at its actual location using an offset.

That way you can map resources from real addresses into arbitrary addresses on the user side.

I think the correct term for this mechanism is 'system address translation'.

The language would still have to understand all of that in order to write that kernel side allocator in safe code.
I don't see how that follows. The language can't possibly understand the intricacies of what the MMU is capable of (besides, every MMU is different), and as far as the language is concerned what is returned is simply a valid offset and a length to go with it to indicate where the allocated segment ends.
I think you're strongly agreeing with me. It's not feasible to have in the language.
Can't the prohibitions be modularized?

Like, when you compile for x86 there are a bunch of rules that aren't generally safe, but on that platform they are.

Modularization wouldn't help the fact that you'd still need a module per platform and that's not feasible, see the other replies to my comment. There's just far, far, far too many details.
Do you mean:

* All unsafe operations don't exist.

* All unsafe operations exist, but the literal unsafe keyword and its machinery doesn't exist

The latter is how most ostensibly safe languages work. See Haskell's UnsafePerformIO, Swift's UnsafePointer, and Java's JNI for 3 examples off the top of my head.

The former is just a really gimped language that would have been a pain in the neck to implement libraries for (see other replies for examples).

It also depends on how "deep" you want to get.

A lot of built-in constructs uses unsafe under the hood. Vector (Rust's dynamic arrays) does memory allocation / resizing under the hood for example, and there's no safe way to do it, unless some safer array allocation primitive is exposed.

Also things like mem::swap(x, y) cannot be implemented at all with safe rust. in order to perform swap, you need a temporary variable. That temporary variable would be uninitialized, which Rust does not allow.

Note that in c++ it invokes copy constructor - http://www.cplusplus.com/reference/algorithm/swap/ - but Rust's mem::swap works for types that does not implement the Copy trait (Rust's equivalent to copy constructor).

Same can be said about slice splitting functions, which is primarily used to work around Rust's borrow checker.

You pretty much nailed it with you question. FFI is by nature unsafe since C doesn't have lifetime semantics and traits around thread safety.
"The bug was a missing annotation, and the result was that users of Rust's stdlib could compile some incorrect programs that violated memory safety."

IIUC, technically, the bug was a missing implementation of a trait and the result was a data race (which I (weirdly, maybe) don't think of as memory safety).

In other words, TL;DR: magic is neat, except that sometimes it really sucks.

I may have misunderstood Ralf's bug. Is it really the case that MutexGuard<T> was seen as Sync if T was Send, rather that Sync? Wouldn't that be a bigger problem than just the case of MutexGuard?

Data races can lead to other more conventional displays of memory unsafety. For instance, a read of a pointer length pair (a &[T] slice in Rust) while a write is occuring could resulting in a torn read where the pointer is the pre-write value but the length is the post-write one. If the original length was short than the new one, this can lead to a buffer overflow.

One framing of the MutexGuard problem is that the type wasn't declared in a way that reflected its semantics best, although it is clearly unfortunate that doing this is more complicated than the incorrect way.

> I may have misunderstood Ralf's bug. Is it really the case that MutexGuard<T> was seen as Sync if T was Send, rather that Sync? Wouldn't that be a bigger problem than just the case of MutexGuard?

So T: Sync if &T: Send. MutexGuard internally contains a &Mutex<T> (and Poison, but that's irrelevant here). T was Cell<i32>. If you follow the rabbit hole, you'll net out that T was Send, and therefore MutexGuard was Sync.

My confusion (and I suspect others) is about what it means for &T to be Sync. Cell<T> isn't safe to be shared across threads (so isn't Sync) but it is Send if T:Send. But that means &Cell<T> is Sync? You can share a reference to something across threads but not the thing itself? What does that even mean?

You could imagine an alternate world where MutexGuard is Send, to allow transfer of ownership of a lock to a different thread while keeping the mutex locked. But that would mean &MutexGuard is Sync, WTF?

The syntax was a bit confusing: T: Sync means (&T): Send and also (&T): Sync. T being Send or not doesn't affect the threadsafety of &T (Send is about transferring ownership which cannot happen with a &T).

You are correct to be confused about (&Cell<i32>) possibly being Sync, because the assumptions that were implied were wrong: when Sync talks about sharing a T, that can be entirely thought of as transferring a &T to another thread (aka Sending the &T). In this sense, sharing a &T between threads (as in, (&T): Sync) is the same as transferring a &&T to another thread, but the inner &T can be copied out so the original &T was also transferred (not just shared) between threads; that is to say, (&T): Sync is 100% equivalent to T: Sync.

Anyway, back to the example here, Cell<i32> is not Sync, so neither is &Cell<i32>, but M = Mutex<Cell<i32>> is Sync (this is a major reason Mutex exists in that form: allowing threadsafe shared mutation/manipulation of types that do not automatically support it), and thus &M is Sync too. Since MutexGuard<Cell<i32>> contains &M, it was thus automatically, incorrectly Sync.

For your second confusion, it is okay for &MutexGuard to be Sync, if MutexGuard itself is. The problem here was MutexGuard was Sync incorrectly in some cases. (MutexGuard is semantically just a fancy wrapper around a &mut T, and so should behave the same as that for traits like Send and Sync.)

Ah! I wasn't following the references close enough.
Your "IIUC" is just a restatement of the sentence you quoted.

You understood the bug correctly, but its not a bigger problem. You probably are lacking context on auto traits, but this blog post contains the context you need if you read it again.

Wait just a minute. Ralf Jung writes,

"This means that the compiler considers a type like MutexGuard<T> to be Sync if all its fields are Sync."

Is that true in general? Is a type thread safe if all its fields are thread safe individually?

Send and Sync are about data races, which lead to memory unsafety, not other forms of thread safety (like dead lock freedom, or maintaining non-unsafe relationships between fields). If there's no unsafe code, then there's no way to have a data race when the individual components are also data race free.
I think it's fair to question a type being auto-derived to be Sync if only individual fields are Sync. It may lead to improper sharing of a reference to this value where threadsafety across fields is needed. That would be a bug, yes, but compiler never made the author pause to consider putting Sync there explicitly (and presumably thinking this through). So that "linting" aspect that's applied to, e.g., *mut T is not present.
I'm not entirely sure what you mean.

The compiler does make the author pause: dangerous types like `*mut T and `UnsafeCell<T>` are not Sync, so types containing them are also not automatically Sync.

In any case, Rust only guarantees no memory unsafety (requiring no data races). It tries to help with other things, like cleaning up resources via destructors, but these are "best-effort" rather than guarantees. The only way to get a data race and/or memory unsafety with an automatically implemented Sync is with `unsafe` code, and any `unsafe` code near a data type "infects" it and so means the whole type require great care.

Tooling like asan and tsan and, hopefully, Rust-specific sanitizers and static analyzers will make this easier to get right, but fundamentally as soon as `unsafe` comes into the equation the programmer has to be paranoid. Of course, as the MutexGuard problem indicates, humans getting it right error-prone, which is why the aforementioned tooling and formal proofs---like the one that found that problem---are important, as is building and using appropriate abstractions (e.g. MutexGuard is semantically designed to be a &mut T, so maybe it could indicate this by using PhantomData, or even just storing that directly: this does require manual work, but pushing conventions like that might bridge help the gap to having great `unsafe` tooling in future).

I suppose my point was that auto-deriving Sync may not have been such a great idea :). I understand the rationale for it but it does open up traps for people to fall into.
Somewhat tangential, but what ensures memory visibility in Rust? Say I allocate a struct (heap or stack), and then pass an immutable reference to a function that takes T: Sync. Assume the struct itself is Sync (e.g. bunch of integer fields). What ensures that the other thread sees all writes to this struct prior to the handoff?
It is the responsibility of cross-thread communication abstractions to use the right fencing (if it is touting itself as safe), probably with the various things in std::sync (especially ...::atomics) if it is pure Rust. For instance, spawning a thread, using a channel (std::sync::mpsc) or a mutex all do such things.

Just calling a function taking T: Sync doesn't need to do any of this, since that call happens all on a single thread. The function might do it internally if it needs to, but that is its own explicit implementation decision.

Ok, that's what I figured - thanks.

That does bring up the question, though, whether it's correct to say that a Sync type doesn't permit data races. In the example I gave above, publishing a Sync struct incorrectly can exhibit data race like symptoms on the receiving thread. So even though the type itself is Sync, that's not enough of a guarantee in the face of "unsafe" publication.

In safe Rust, sharing a value of any Sync type between threads can't result in data races. Send and Sync provide thread safety guarantees about types that other safe abstractions can rely upon, and fencing correctly is one of the things those abstractions have to do to be safe.

I guess "Sync types don't have data races" is the abbreviated version of "Sync types don't have data races in any safe code, no matter how weird and wonderful". That said, this qualification doesn't seem very interesting to me: something equivalent is required about pretty much any statement about any guarantee in any language with unsafe code or FFI (e.g. in Python, something along the lines of "pointers don't dangle in any code that doesn't use `ctypes`"), and thus is elided in a lot of discussions about programming languages.

If you consider `unsafe` Rust, then failing to fence correctly is just one way to get a data race.

It is a guarantee---or else it's a bug (which is the same as every other safe foundation). A type T gets to be `Sync` in one of two ways:

  1. It is "auto derived" when all of its constituent types are Sync.

  2. It is explicitly implemented using `unsafe impl Sync for T {}`. Note the use of the `unsafe` keyword.
And this is why I stick with ASM - I don't have to rely upon everyone else not screwing the pooch when it comes to them developing a language - I just talk straight to the computer, nothing gets lost in translation, my programs are 200x smaller and 400x faster than anything written in Rust.

2D Second Life clone, with full programming capability with built-in database - 2 megabytes. Solid ASM. Rust can't even come close, and never will.

> [..] my programs are 200x smaller and 400x faster than anything written in Rust.

And take 100x longer to develop :)

Nope. Once I start typing the code simply flies.