Hacker News new | ask | show | jobs
by kzrdude 3075 days ago
Rust is a language that offers you lots of compile time checks, and an escape hatch called unsafe that says “trust the programmer here.” Yes, it is possible—and easy—to make mistakes in the place where you have asked to be trusted, not checked.

We have a big pedagogical task ahead of us in teaching safe practices for unsafe Rust, and defensive coding practices in unsafe Rust.

We should also think of if we can improve unsafe Rust to be harder to misuse. There are improvements coming in compile time evaluation, and those can potentially make the compiler much stronger when it comes to detecting memory errors in unsafe code at compile time.

6 comments

There's not only a pedagogical task here, but the Rust community must learn how to write code safely. The major difficulty here is that in general, unsafe pieces of code cannot be safely composed, even if the unsafe pieces of code are individually safe. This allows you to bypass runtime safety checks without unsafe code just by composing "safe" modules that internally use unsafe code in their implementation.

This kind of problem comes up a lot. Composed atomic operations are not atomic. Composed correct threaded code is not always correct. Mixing Scheme control structures made with call/cc don't work as desired. Enabling different Haskell language extensions gets you off the deep end quickly, and some unsafe combinations are surprising (see GeneralizedNewtypeDeriving, which is considered unsafe even though it used to be safe).

> The major difficulty here is that in general, unsafe pieces of code cannot be safely composed, even if the unsafe pieces of code are individually safe. This allows you to bypass runtime safety checks without unsafe code just by composing "safe" modules that internally use unsafe code in their implementation.

This comment suggests you don't have much domain knowledge about how `unsafe` in Rust works, so I'm surprised you speak with such confidence. Your comment is flatly wrong: users using only safe code are not responsibility for guaranteeing the composed safety of the components they use (whether or not they are implemented with unsafe code).

Interfaces marked safe must uphold Rust's safety guarantees, or they are incorrect. They are just wrong if they have additional untyped invariants that need to be maintained to guarantee their safety; interfaces like this must be marked `unsafe`.

Because they cannot depend on untyped invariants, any correct implementation with a safe interface can be composed with any other. This ability to create safe abstractions over unsafe code which extend the reasoning ability of the type system is a fundamental value proposition of Rust.

> This comment suggests you don't have much domain knowledge about how `unsafe` in Rust works, so I'm surprised you speak with such confidence.

I hate being tone police, but jeez, we're having a discussion about Rust here and talking about my personal competency is inappropriate and unwelcome.

The problem I'm talking about happens when you write libraries that contain "unsafe" blocks. You want to prove (or at least assure yourself) that no unsafe behavior is observable by clients of the library. However, the way to do this is not entirely clear, although there is research being done in this area. One known trap is that it is not sufficient to demonstrate that Rust code without "unsafe" blocks cannot observe unsafe behavior in your library.

See: https://plv.mpi-sws.org/rustbelt/popl18/paper.pdf

These concerns are not hypothetical, there have been soundness problems in the Rust standard library before and I expect it to happen again.

Proving the correctness of unsafe code is totally different from what you talked about, which was composing different abstractions with unsafe internals together.

Users of safe Rust do not need to worry about whether the composition of two safe interfaces that use unsafe internally is safe unless one of those interfaces is incorrect. Your comment would suggest that users need to think about the untyped invariants of each library they use, but this is not correct, libraries are not allowed to rely on untyped invariants for the correctness of their safe APIs.

The problem with talking about this subject is that "safe" and "unsafe" are overloaded terms in Rust, so I can understand why you think I was talking about something different.

Let R be arbitrary Rust code with no "unsafe" blocks. Let X and Y be libraries with "unsafe" blocks. You can prove that R + X is safe, and prove that R + Y is safe, but you haven't yet proven R + X + Y is safe. This is the hard part, because without an understanding of what property of X and Y individually makes R + X + Y + Z + ... safe, we don't have a good definition for what makes an interface "safe".

And this is what I mean when I say that this is not only a pedagogical problem.

You have restated your position, but it is still incorrect in the context of this discussion. Even your original statement of "R be[ing] arbitrary Rust code with no 'unsafe' blocks" is problematic: any Rust code is, very unavoidably, built upon a foundation of unsafe code. It has to be, because it's running on an "unsafe" processor. And yet, any safe Rust code in the core library (barring a safeness bug) is obviously safely composable with any other safe Rust code precisely because it obeys safety guarantees when transitioning from unsafe to safe. The fact that you can mistakenly conceive of Rust code that somehow avoids any internal unsafety simply reinforces how obvious this simple fact is.

But using your original problem statement, if R is safe and X and Y use unsafe code but do not expose any unsafe interfaces, then either R + X + Y is safe or one of [X, Y] has a safety bug and is inaccurately marking an unsafe interface as safe.

This is a generally unsolvable problem, and every other language has this problem as well; the difference being that in most other languages you're typically forced to write the unsafe code in C (where one has much greater variety of footguns available at their disposal). If I write a Ruby FFI wrapper for buggy C code whose interfaces bleed "unsafe" (from the perspective of the Ruby VM) behavior, then I am liable to experience crashes and memory corruption bugs. The only difference here is that Rust allows you to break the seal on the warranty without switching to a different language.

> One known trap is that it is not sufficient to demonstrate that Rust code without "unsafe" blocks cannot observe unsafe behavior in your library.

I'm curious: what does this mean/could you point me to the part of the paper that describes it? (Unfortunately, I don't have time to read all 34 pages at the moment.)

p. 66:3, the two paragraphs starting with "However, there is cause for concern..."
Thanks!

I'm not convinced that the statement in the paper translates into what you said: the key piece of that paragraph is "or seems to be". The Leakpocalypse problem was one piece of code (crossbeam's scoped threads API) was relying on an invariant that doesn't actually hold ("destructors will always run"). It was, fundamentally, a bug in the `unsafe` code in crossbeam, meaning it was incorrect for crossbeam to call its API safe: the fact that it took multiple libraries to trigger in that case means nothing, it just happens to be the circumstances under which the problem was noticed.

Of course, to be fair, no-one had thought about this destructor property before, just implicitly relied on it, and so it does demonstrate the necessity for better understanding of/tools for unsafe code, which is what projects like RustBelt are pushing towards.

To summarise, I still don't see how these two sentences are different:

> no unsafe behavior is observable by clients of the library

> [clients] without "unsafe" blocks cannot observe unsafe behavior in [the] library

Indeed, I don't think it makes sense to even attempt to prove that clients with unsafe code can't observe unsafe behaviour (which seems to be the only way for the second sentence to differ from the first). The typical framing is that the safe code can be arbitrarily bad and there'll still be no undefined behaviour, but arbitrary `unsafe` can do anything, including writing directly to another library's data structures, which of course can easily cause UB (e.g. replace a Vec's data pointer with a null one).

Correct me if I’m wrong but I thing GP was stating that composing two “unsafe” blocks together (both of which are manually verified to work well) might interfere with each other when run simultaneously.
'unsafe' doesn't mean unsafe. Unsafe means "I can't convince the compiler that this is safe. But in my context, it is."

If there is any way in which a function containing an `unsafe` block may be used unsafely (specifically, violating memory-safety), then that function must also be marked as unsafe.

That's what it means if you use it properly. If you write bad code, it means "this code will break everything and the compiler won't protect you." An `unsafe` block does nothing to guarantee that you're doing something safe, which is what you seem to be saying, even if it's not what you mean to say.
This is pretty much tautological, and nobody's arguing this point. However, you have the benefit of being able to narrow your search scope to the parts of your code marked `unsafe` instead of the entire project.

Most things don't need unsafe code. For the things that do, you must yourself uphold the invariant that all requirements of safety are being obeyed when transitioning out of an unsafe block. If you don't do this, bad things can happen. Other languages don't have this because they either don't offer Rust's safety guarantees in the first place, or the only way to circumvent them is to write code in C.

Its possible that I misread the comment; it seemed to state that this problem extended into safe Rust, which it definitely does not.
Think of two libraries that use unsafe Rust and interact with the same hardware, but work correctly when used on their own.

A program written only in pure not-unsafe Rust might use these two libraries in a way that breaks because the assertions the programmers of the libaries had, like for example having exclusive access to the hardware, are wrong now.

One could argue the pure not-unsafe Rust program is wrong, not the libraries.

I think klodolph's comment is very thoughtful and shows a good deal of experience and domain knowledge.

There is a conflation happening here. What is the nature of this bug when you compose these two libraries together?

If it is a violation of Rust's safety guarantees, then at least one of those libraries has a bug, it is exposes a safe abstraction which is not actually safe. One could not argue that the safe Rust program is wrong; the library exposing an unsafe interface as safe is unarguably wrong.

If the library just behaves incorrectly in a manner disconnected from the type system because some global state was changed in a way it doesn't expect ("the hardware" in this case), then that's a normal bug & it is not connected to unsafe code at all.

> see GeneralizedNewtypeDeriving, which is considered unsafe even though it used to be safe

This is wrong. GND plus TypeFamilies or some other extension in that vein used to be unsound when combined. It has since been fixed via the introduction of type roles.

> Composed atomic operations are not atomic.

Incidentally, Haskell also has this figured out via the STM monad.

I hadn't heard about the advancement with Roles, but it seems that GND is still prohibited in "Safe Haskell"?
You’re probably reading an old document - this restriction was removed with the introduction of roles. https://downloads.haskell.org/~ghc/7.8.4/docs/html/users_gui...

Roles were introduced in 7.8.something, and GND was added to Safe.

Rust's approach to "unsafe" is to let the programmer do whatever they want. Having to use this for UNIX-type API calls is kind of lame.

I once proposed extending C to allow talking about array sizes.[1] You'd define "read" as

    int read(int fd, char &buf[len], size_t len);
The compiler now knows that "buf" is an array with length "len", and can check calls for "buf" being the right size. The generated code for the call is the same; this doesn't require array descriptors. It just says which parameter defines the length of the array.

All the original UNIX calls and most of the Linux ones fit into that simple model. If the size of something is hard to define simply at an API call, the API has a problem.

Rust's system for external C calls should be more like that and less about casts to raw pointers. It's technically possible to fix this in C, and have a "strict mode", but the political problems are too hard.

[1] http://www.animats.com/papers/languages/safearraysforc43.pdf

> Rust's system for external C calls should be more like that and less about casts to raw pointers.

It seems a rosy-eyed view to think that this would helping safety significantly, and would require a lot of effort: it's likely to be much lower pay-off than other things, like investing in, say, sanitizers or even just doing the work of writing safe wrappers for popular C libs, removing C FFI concerns from most people, who can just use the Rust library.

Specifically, as you say, C doesn't have this information, meaning there's no way for Rust's (or another language's) FFI to work like this automatically. Instead, someone will have to annotate the C code, have some extra "notes" layer, or annotate the imported Rust declarations. Either way, there's a human element, meaning a place for mistakes to be made. It seems like the less-duplicative way to do this is to make Rust wrappers that take Rust slices, since these will be wanted in the end anyway.

Of course you want to use Rust slices. Those map directly to the kind of C array I outlined. If you could declare a C API that way to Rust, you'd get the mapping without talking about pointers explicitly at all.

What I'm arguing for is a declarative way to talk about C interfaces that is consistent with Rust's model. This is better than using "unsafe" to construct C-type raw pointers. Yes, this is more restrictive and there will be some awful C APIs you can't describe. That's a good indication said C API is trouble.

What would make this "declarative way to talk about C interfaces" less error prone than something like this?

    extern fn read(fd: c_int, buf: *mut c_char, len: usize) -> isize;

    pub fn read(fd: c_int, buf: &mut [c_char]) -> isize {
        unsafe { read(fd, buf.as_mut(), buf.len()) }
    }
Further, note that this is insufficient for an idiomatic Rust API. You would also want to wrap the file descriptor (perhaps not for all C APIs) and the return value (definitely applies to all C APIs). So it would really look more like this:

    pub struct File { fd: c_int }

    impl File {
        pub fn read(&self, buf: &mut [u8]) -> Result<usize, ReadError> {
            let r = unsafe { read(self.fd, buf.as_mut(), buf.len()) };
            if r == -1 {
                Err(ReadError::from(errno))
            } else {
                Ok(r as usize)
            }
        }
    }
I can certainly imagine a way to do that declaratively, but not in a way that helps even this most basic of examples. (Also, note that constructing raw pointers is completely safe- `as_mut` for example.)
That's not bad. It would be useful to be able to use some kind of "C slice" in an extern fn declaration, so you could talk about arrays, rather than pointers. Same function call code, but more Rust-line syntax. Then you don't need unsafe imperative code at all.

This would put all the memory-risky stuff in declarations of external functions.

> I once proposed extending C to allow talking about array sizes.[1]

That would be a very useful, and relatively unobstrusive, extension to C. I've always liked the idea of a C "strict mode". I wish the political problems weren't so hard.

That's for local variables. Microsoft and Linus Torvalds didn't like it, because it's a way to suddenly cause unexpected stack growth of arbitrary size. That feature was made optional in C++11, and Microsoft never implemented it.
FWIW Microsoft does have SAL annotations to do the same thing. For example fread's prototype is

    size_t fread(
        _Out_writes_bytes_(_ElementSize*_Count) void * _DstBuf,
        _In_ size_t _ElementSize,
        _In_ size_t _Count,
        _Inout_ FILE * _File
    );
https://docs.microsoft.com/en-us/visualstudio/code-quality/a...
C++ compilers also have references to arrays which can be abused in some cases:

    template < size_t len > int read(int fd, char (&buf)[len]); // array size will be infered
    int read(int fd, char (&buf)[1024]); // array size must be exactly 1024
C++ largely suffers from the same problems. Often a C++ programmer can write code which relies on iterators and containers which is quite safe and difficult to mess up, while for a variety of highly-specialized applications, mixtures of packed structs, pointer arithmetic, and arbitary sequences of binary data need to be handled with utmost care.

Knowing when to use which set of tools and how to safely glue them together is important.

Now, I will say that the C++ community has been teaching safer, cleaner practices for years now and users seem to be largely adopting them. It works, as long as the developers don't pay a runtime or excessive development cost to do so.

[I'm sure a crustangelist is likely to come tell me that I can never write safe C++ code and that the universe will hate me for eternity for not leaping to rust, but please, understand that I don't suffer from unsafe memory issues on the whole because modern C++ is quite safe. You won't convert me, but I'm also not trying to convert you.]

Container and iterator code is not safe at all since there is no bounds checking by default and no protection against iterator invalidation, which can both cause writes to memory outside the intended object and thus a catastrophic outcome.

There is no safe subset of C/C++ unless you just don't use pointers or references at all (and refrain from using any library that is not safe which includes large parts of the standard library like all the containers), or you write it in Rust or an equivalent language with lifetimes and linear types and automatically translate it to C/C++ somehow.

> unless you just don't use pointers or references at all (and refrain from using any library that is not safe which includes large parts of the standard library like all the containers)

It may seem far fetched, but it might be more practical than you'd think. The SaferCPlusPlus[1] library provides memory-safe implementations of the most commonly used standard library containers, and pointer types that reflect the lifetimes of their target objects. That is to say, there is a practical subset of C++ that is more closely comparable to safe Rust than is conventional C++.

[1] shameless plug: https://github.com/duneroadrunner/SaferCPlusPlus

To be fair, I believe you and the individual you are replying to are treating the word 'safe' differently. Correct; C++ doesn't have a built-in concept of "safe" that is compiler guaranteed and anything written in that, if it isn't written defensively at literally every line of code, falls on the library consumer to handle that.

Rust, CLR/JVM/interpreted languages are 'safe' because the compiler will flat out refuse to do things that are unsafe (with exception to Rust and some non-interpreted languages allowing you to declare portions of code with as 'unsafe'/'hold my beer'). Short of bugs in compiler/standard library, or unsafe code from libraries written in 'unsafe' languages that are consumed by safe languages (which usually requires a bug in the library, not a bug with how the library is called in the "safe" context, but not always), C++ is 'not safe at all' by comparison. I think if you swap the word 'safe', with 'reliable', that was what the individual you were replying to was getting at. 'Safe' in this context is: "The compiler put the foot-shooting-gun in a safe", vs. 'reliable' is "the gun is in my hand, has no safety, and a somewhat light trigger but it's aimed at the target, not my foot ... as far as I know".

You can handle pointers and references safely as well as use components of the standard library that don't do bounds (or a lot of other, "perfectly reasonable but missing for performance/philosophical reasons") checks, but it's up to you.

A really terrible analogy: it's illegal to drive a car where I live with either of the front passengers lacking a safety belt. Heck, you can't even build a car without a number of safety features that regulation requires. It's also got a number of features to help you avoid accidents. If you or someone screws up on the road, you're protected by the safety features and your mastering of driving. That's the 'safe' programming languages that most people use these days. C++/C is like my motorcycle. The only safety features it comes with rely entirely on my skill at not only "not making mistakes" but anticipating the mistakes of others -- I've had several close calls but have been able to maneuver around other distracted drivers/library maintainers, but if I'm not paying attention to everyone/everything around me I'm toast. And even then, some accidents are unavoidable that would have been survivable with a steel cage and a safety-belt[0].

[0] But damn, that bike is fast, and unlike C/C++, it's a lot more fun to use than the safer alternatives.

Reliable is a better word. C++ has lots of features to help you avoid accidents. It's just that you know that certain operations take certain levels of precautions. I like C++. I like the expressiveness it provides. Yes, some things are unsafe, but the amount of time I've spent finding segfaults or other memory errors since becoming proficient is an epsilon in relation to the amount of time I've spent getting all of my crazy template magic to fit into the right spots.
I'll be an anti-crustangelist and say that I've actually avoided moving a C++ project I've been working on over to Rust because (1) I'm finding that fixing some of the previous code's pre-C++xx practices is suitable enough and (2) I've only written a few small things in Rust up to this point and learning the 30% or so more that I'd need to in order to get things fixed would take more time and has more unknowns. Granted, it's not a large application, the code is easy to follow in its current state, and it wasn't "already a mess" when I started working on it. Fixing its problems has required careful review of the code-base but it was certainly possible and practical to correct the issues with this (small) application; even by someone as weak in C++ as I am[0].

C++ has improved quite a bit, from my perspective, anyway.[1] That said, I'm excited about Rust and have started (shallowly) exploring it. I like what I see, so far; particularly with improvements on the ergonomics of the language. Seeing it put to use in major projects (cough Firefox) successfully and reading about the problems it solved for Mozilla is the main reason I've set a goal to become proficient in it this year. It's a tall order to commit to a new language, particularly when the other languages I write in generally do everything I need them to. There's a small number of things, though, that still pull me toward C++, and I'd rather have an alternative.

As pleasantly surprised as I was with C++, I had plenty of four-letter-word-riddled moments. Practically all of it stemmed from old libraries, or legacy pieces/parts with my favorite being "lets look at the documentation to see what kind of string this method expects/returns". Character encoding, character byte-sizes, differences between byte-length and semantic length are all complexities when dealing with strings -- many of which get hidden away by CLRs or JVMs or script interpreters. And I'm sure there's some reasons that a person with moderate C++ knowledge could tell me as to why so many of the recently developed (proprietary) libraries seemed to love to pass pointers to non-unicode character arrays around (performance? comfort? nationalist? satan worship?), but it was a punch in the face when I knew an "easy" std::string was right there and never needed to be a character array/serve as a buffer/do anything but be a unicode string for a brief moment of existence. And if I have to figure out why Hunter failed to download the boost library because someone statically linked it to cURL without https support, or used the built-in implementation and compiled it with the wrong flags, or for whatever reason, the downloaded version fails the SHA1 check Every. Single. Time. ... well, no need to conclude that one.

Heck, I'd argue crates is a C++ killing feature for me. Yes, Hunter can be made to work (kicking and screaming, sometimes) with cmake, which I'm told can also be made to work. Microsoft has one, too (I can't remember its name and I know they were working on making it possible to just "use NuGet"[3], but I've always felt that a lack of easy dependency retrieval and management caused three problems (1) people use old libraries that are very likely to be present on the target build host, (2) people write their own (poor, naive) implementations for Solved Problems(tm) or (3) the miserable fck doesn't build, there's not enough documentation to figure out in blue-blazes <qwertyuio.h> is, who wrote it and where it came from and when you do* finally find it, it won't build because it's missing its dependencies, so pick (1) or (2) or give up. Compared against '(package-manager) install (package)' and hey, I'm writing code like I originally set out to!

Wow, this devolved pretty quickly into a rant. My apologies for that -- it really isn't as bad as I've made it sound and I realize that most/all of these are my problems and I'm not knocking a language (or folks who program in it) for not bending to my will and having every feature I want, but I'm hopeful for what's coming around with Rust, D and others that are tackling the systems programming space. This Zig article caused me to read several others, as well. The compile-time variables as a workaround for lack-of-macros[3] looks like an interesting idea -- I'm not sure if the syntax is clear enough (globals are implicitly compile-time) but since it's a somewhat unfamiliar syntax, I lack experience to speak intelligently on that.

[0] My adventure started with troubleshooting a very consistent memory leak that was generally caused by some code-in-a-loop that failed to delete things. Often the solution was to change code to use something from boost (which it took a hard dependency on, anyway) or wrapping it in a class and RAIIing my way to a better reality. (and can we get a new acronym? I always write RIAA and if I don't write it, I see it and hairs on my neck stand up)

[1] I "gave up" C++ development around 2001 and short of reading code on rare occasion, didn't seriously start working in the language again until a couple of years ago. I felt like I was writing in a different language -- not sure if that was perception having been away from it for so long, or if it really was that different -- it took a lot of reading to get to a point where I was comfortable breathing in the direction of the code I was playing with.

[2] And NuGet could be a good option, here, especially if they move away from its roots of being somewhat of "it's really just a powershell script with a kludgy metadata file" since I'd rather not add yet another shell to my non-Windows hosts that already have alternatives that I prefer. Last I looked -- .Net Standard pre-2.0, they were fixing the metadata problems -- and maybe they weren't all that bad to begin with considering I can't think of the last time NuGet got in my way on a .Net app.

[3] Though, ideally, I want both.

edit: fix some bad footnote pointers - sheesh, can't even write a comment without a segfault

Well, if you want to write a tree using indices in a local array instead of pointers for better locality and memory footprint (which is ideal for many situations), you run into the difference between language pointers and computer science pointers. That's not even something that Rust will be smart enough to help you do properly.
In fact (unless something has changed dramatically without my getting the memo), that's how you would write many data structures in Rust---either for better memory behavior or you have cycles and Rc won't cut it. You have to use a vector of nodes and indices as pointers; if you try using references, the borrow checker comes and kicks sand in your face.
It's doable to keep using pointers instead of indices- they just all have the lifetime of the vector and you can freely follow them around.

This does prevent resizing the vector, but you can get around that by using a different arena that allocates in chunks rather than reallocating (and thus doesn't require a unique reference for .insert).

Rust can totally help you do that properly. You can wrap the indices in a struct parameterized by a lifetime and regain all the same tools you would have with language pointers.
I would suspect that this particular issue is not necessarily a defensive coding problem.

https://gist.github.com/andrewrk/182ace5dee6c4025d8c4b0ca22c...

https://github.com/andrewrk/libsoundio/blob/fc96baf8130b52ba...

I've written that code before, and I know better (but then, all the world's an x86 box, right?) But first, I'm not sure how to make that code not broken (yes, that's an education issue), and second, the same arguments can be made about all the issues Rust is designed to prevent.

This really should be a compiler warning.

It will be interesting to watch the proportion of safe/unsafe code in large Rust codebases over time.
At a guess, it will increase until the necessary-but-not-currently-handled constructs are dealt with (arena-based memory management, I'm looking at you) and then decrease asymptotically. Already in Rust, if you adopt C++ STL idioms (and don't want to squeeze more performance out) and don't need to visit currently-unwrapped interfaces, you won't need unsafe at all.

Rust is a very good C++ replacement.

To put a point on this question: should an unsafe language (or language subset) be as safe as possible, or as unsafe (i.e. powerful) as possible?
Unsafe Rust is a superset, not a subset, incidentally.
It seems to me that "unsafe" Rust is a subset of Rust as a whole. Unless the Rust language does not support unsafe code at all?
"Unsafe" Rust is a superset because everything you can do in normal "safe" Rust, you can do within an "unsafe" block. That is, being within an "unsafe" block (which is what people mean by "Unsafe Rust") allows you to do more, not less.
> It seems to me that "unsafe" Rust is a subset of Rust as a whole.

Sure. When people say "Rust" they usually mean "safe Rust". But if we consider "Rust" as a whole, "Safe Rust", and "Unsafe Rust", then:

Rust is Unsafe Rust

Safe Rust is a subset of Unsafe Rust (and therefore Rust).