Hacker News new | ask | show | jobs
by pron 2425 days ago
> Can we identify why exactly 1:1 threading is uncompetitive with async/await schemes, and fix that?

I think we can identify it, but fixing it is not easy, at least at the kernel level.

The easy part is the cost of scheduling. The kernel must schedule threads with very different behaviors -- say, encoding a video or serving requests over a socket -- so it uses a scheduling algorithm that's a compromise for all uses. But we can let the language's runtime take the scheduling of kernel threads with something like this: https://youtu.be/KXuZi9aeGTw

The harder part is managing the memory required for the stack. Even on 64-bit systems, the OS can't shrink and grow stacks finely enough. For example, I don't think there's a hard requirement that, say, memory below the sp is never accessed, so the OS can't even be sure about how much of the stack is used and can't uncommit pages. But even if there were such a requirement, or, that the language could tell the OS it follows such a requirement, still the OS can only manage memory at a page granularity, which is too much for lightweight concurrency. Any finer than that requires knowing about all pointers into the stack.

We can do it in languages that track the location of all pointers in all frames, though, which is what we're attempting to do in Java, and this allows us to move stacks around, grow them and shrink them as necessary, even at a word granularity.

> All this, by the way, is not to say that Rust made the wrong decision in focusing on async/await. Rust generally follows the philosophy of "if the computing environment is difficult, it's Rust that has to adapt".

... and it follows C++'s (horribly named) "zero-cost abstractions" philosophy which can reasonably be said to be a requirement of Rust's target domains. So I certainly don't think async/await is wrong for C++/Rust/Zig, but I think it's wrong for, say, JavaScript and C#, and we're going a different way in Java (more like Scheme's).

Another possible contributing factor is that in one respect Rust is a higher-level language than Java or JS: it compiles to a VM -- LLVM/WASM -- over which it doesn't have full control, so it doesn't have complete control over its backend. That, BTW, is why Kotlin adopted something similar to async/await.

2 comments

> at a page granularity, which is too much for lightweight concurrency

I don't think that's been conclusively shown. 4kB is smaller than a lot of single stack frames. It would be interesting for someone to measure how large e.g. Go stacks are in practice—not during microbenchmarks.

> Another possible contributing factor is that perhaps ironically, in one respect Rust is a higher-level language than Java or JS, as it compiles to a VM -- LLVM/WASM -- over which it doesn't have full control, so it doesn't have complete control over its backend.

It's not really a question of "control"; we can and do land changes upstream in LLVM (though, admittedly, they sometimes get stuck in review black holes, like the noalias stuff did). The issue is more that LLVM is very large, monolithic, and hard to change. Upstream global ISel is years overdue, for example. That's one of the reasons we have Cranelift: it is much smaller and more flexible.

But in any case, the code generator isn't the main issue here. If GC metadata were a major priority, it could be done in Cranelift or with Azul's LLVM GC support. The bigger issue is that being able to relocate pointers into the stack may not even be possible. Certainly it seems incompatible with unsafe code, and even without unsafe code it may not be feasible due to pointer-to-integer casts and so forth. Never say never, but relocatable stacks in Rust seems very hard.

The upshot is that making threading in Rust competitive in performance to async I/O for heavy workloads would have involved a tremendous amount of work in the Linux kernel, LLVM, and language design—all for an uncertain payoff. It might have turned out that even after all that work, async/await was still faster. After all, even if you make stack growth fast, it's hard to compete with a system that has no stack growth at all! Ultimately, the choice was clear.

> I don't think that's been conclusively shown.

That would be interesting to study. We'll do it for Java pretty soon, I guess.

> After all, even if you make stack growth fast, it's hard to compete with a system that has no stack growth at all!

If the same Rust workload were to run on such a system there wouldn't be stack growth, either. The stack would stay at whatever size Rust now uses to store the async fn's state.

> a tremendous amount of work in the Linux kernel

I don't think any work in the kernel would have been required.

> Ultimately, the choice was clear.

I agree the choice is pretty clear for the "zero-cost abstractions" approach, and that Rust should follow it given its target domains. But for domains where the "zero-cost use" approach makes more sense, the increase in accidental complexity is probably not worth some gain in worst-case latency. But that's always the tradeoff the two approaches make.

You suggested the switchto patch, right? That requires Linux kernel work.
Assuming you know how to relocate stacks, you could implement efficient delimited continuations in Rust without the kernel's involvement, and use whatever it is you use for async/await scheduling today (so I'm not talking about some built-in runtime scheduler like Go's or Erlang's). But this would require changes in LLVM, and also a somewhat increased footprint (although the footprint could be addressed separately, also by changes in LLVM). The kernel work is only required if you want to suspend computations with non-Rust frames, which you can't do today, either.
That sounds like mioco/old M:N. That was slower than our current 1:1 threading.
I don't know what the old M:N model did, but a compiler can generate essentially the same code as it does for async/await without ever specifying "async" or "await." The reason for that is that the compiler already compiles any subroutine into a state machine (because that's what a subroutine is) in a form you need for suspension. But you do need to know what that form is, hence the required change in LLVM.

There is no inherent reason why it would be any slower or faster. It also does not imply any particular scheduling mechanism.

> So I certainly don't think async/await is wrong for C++/Rust/Zig, but I think it's wrong for, say, JavaScript and C#, and we're going a different way in Java (more like Scheme's).

How can it be wrong for JavaScript? Because JavaScript is a single threaded environment, it was the only possible choice and it's one of the most influential change in the language (in itself it changed more the language use than the whole ES6 bundle). It's a really great success and I'm not sure why you'd like to revert it.

> I'm not sure why you'd like to revert it.

I don't. The old way and async/await aren't the only two options.

> it was the only possible choice

Why?

What would you do instead? Use a green thread system on top of a single-threaded VM? Great, now you have two kinds of blocking calls: the one which only block its own thread, and the one which block all threads at the same time because it blocks the VM itself. How ergonomic!

Remember, the single threaded character of the js VM is not an implementation detail, it's part of the spec. Hate it or love it but the web works this way.

Also, if you think threads are a good concurrency abstraction, let's play a little game: consider you need to read 1M files on a spinning disk. How many threads do you need to run on to get the maximum read performance:

a) one per file

b) just one

c) one per core

d) a magic number which depends on your hard disk's firmware and your workload.

Convenient and intuitive right?

Threads are a dated concurrency primitive which would have died long ago if wasn't also a good parallelism primitive.

> Great, now you have two kinds of blocking calls: the one which only block its own thread, and the one which block all threads at the same time because it blocks the VM itself. How ergonomic!

You could only have the first kind, but it's the same situation with async/await. Only async/await tracks the "good" kind of blocking, yet lets the "bad" kind go untracked.

> How many threads do you need to run on yo get the maximum read performance

The exact same number as you would for doing it with `await Promise.all` The same knowledge you have about the scheduling mechanism doesn't go away if you're no longer required to annotate functions with `async`.

> Threads are a dated concurrency primitive which would have died long ago if wasn't also a good parallelism primitive.

Maybe they are, but async/await are the exact same construct only that you have to annotate every blocking function with "async" and every blocking call with "await". If you had a language with threads but no async/await that had that requirement you would not have been able to tell the difference between it and one that has async/await.

> Only async/await tracks the "good" kind of blocking, yet lets the "bad" kind go untracked.

The “bad kind” is indistinguishable from CPU intensive computation anyway (which cannot be tracked), but at least you have a guarantee when you are using the good kind. (Unfortunately, in JavaScript, promises are run as soon as you spawn them, so they can still contain a CPU heavy task that will block your event loop, Rust made the right call by not doing anything until the future is polled).

> The exact same number as you would for doing it with `await Promise.all`

From a user's perspective, when I'm using promises, I have no idea how it's run behind (at it can be nonblocking all the way down if you are using a kernel that supports nonblocking file IOs). This example was specifically about OS threads though, not about green ones (but it will still be less expensive to spawn 1M futures than 1M stackful coroutines).

> Maybe they are, but async/await are the exact same construct only that you have to annotate every blocking function with "async" and every blocking call with "await". If you had a language with threads but no async/await that had that requirement you would not have been able to tell the difference between it and one that has async/awaitof

I don't really understand your point. Async/await is syntax sugar on top of futures/promises, which itself is a concurrency tool on top of nonblocking syscalls. Of course you could add the same sugar on top of OS threads (this is even a classic exercise for people learning how the Future system works in Rust), that wouldn't make much sense to use such thing in practice though.

The question is whether the (green) threading model is a better abstraction on top of nonblocking syscalls than async/await is. For JavaScript the answer is obviously no, because all you have behind is a single threaded VM, so you lose the only winning point of green threading: the ability to use the same paradigm for concurrency and parallelism. In all other regards (performance, complexity from the user's perspective, from an implementation perspective, etc.) async/await is just a better option.

> which cannot be tracked

Of course it can be tracked. It's all a matter of choice, and things you've grown used to vs. not.

> but at least you have a guarantee when you are using the good kind

Guarantee of what? If you're talking about a guarantee that the event loop's kernel thread is never blocked, then there's another way of guaranteeing that: simply making sure that all IO calls use your concurrency mechanism. As no annotations are needed, it's a backward-compatible change. That's what we're trying to do in Java.

> but it will still be less expensive to spawn 1M futures than 1M stackful coroutines.

It would be exactly as expensive. The JS runtime could produce the exact same code as it does for async/await now without requiring async/await annotations.

> Async/await is syntax sugar on top of futures/promises, which itself is a concurrency tool on top of nonblocking syscalls.

You can say the exact same thing about threads (if you don't couple them with a particular implementation by the kernel), or, more precisely, delimited continuations, which are threads minus the scheduler. You've just grown accustomed to thinking about a particular implementation of threads.

> The question is whether the (green) threading model is a better abstraction on top of nonblocking syscalls than async/await is

That's not the question because both are the same abstraction: subroutines that block waiting for something, and then are resumed when that task completes. The question is whether you should make marking blocking methods and calls mandatory.

> In all other regards (performance, complexity from the user's perspective, from an implementation perspective, etc.) async/await is just a better option.

The only thing async/await does is force you to annotate blocking methods and calls. For better or worse, it has no other impact. A clear virtue of the approach is that it's the easiest for the language implementors to do, because if you have those annotations, you can do the entire implementation in the frontend; if you don't want the annotation, the implementors need to work harder.

Of course, you could argue that you personally like the annotation requirement and that you think forcing the programmer to annotate methods and calls that do something that is really indistinguishable from other things is somehow less "complex" than not, but I would argue the opposite.

I have been programming for about thirty years now, and have written extensively about the mathematical semantics of computer programs (https://pron.github.io/). I understand why a language like Haskell needs an IO type (although there are alternatives there as well), because that's one way to introduce nondeterminism to an otherwise deterministic model (I discuss that issue, as well as an alternative -- linear types -- plus async/await and continuations here: https://youtu.be/9vupFNsND6o). And yet, no one can give me an explanation as to why one subroutine that reads from a socket does not require an `async` while another one does even though they both have the exact same semantics (and the programming model is nondeterministic anyway). The only explanation invariably boils down to a certain implementation detail.

That is why I find the claim that even when two subroutines have the same program semantics, and yet the fact that they differ in an underlying implementation detail means that they should have a different syntactic representation, is somehow less complex than having a single syntactic representation to be very tenuous. Surfacing implementation details to the syntax level is the very opposite of abstraction and the very essence of accidental complexity.

Now, I don't know JS well, and there could be some backward compatibility arguments (e.g. having to do with promises maybe), but that's a very different claim from "it's less complex", which I can see no justification for.