Going down the same rabbit hole earlier this week, found this to be a good explanation:
All of the data needed by a task is contained within its future. That means we can neatly sidestep problems of dynamic stack growth and stack swapping, giving us truly lightweight tasks without any runtime system implications. ... Perhaps surprisingly, the future within a task compiles down to a state machine, so that every time the task wakes up to continue polling, it continues execution from the current state—working just like hand-rolled code. [1]
> Perhaps surprisingly, the future within a task compiles down to a state machine, so that every time the task wakes up to continue polling, it continues execution from the current state
How are those tasks implemented, and what's scheduling them?
You need a third-party library to provide an executor. Rust does not come with one to keep the runtime size small. The community seems to have centralized on https://tokio.rs/ (under the hood it uses epoll/whatever OS-specific functionality to schedule M:N)
Cooperative scheduling is used to schedule tasks on executors. A single executor is expected to manage many tasks across a small set of threads. There will be a far greater number of tasks than threads. There also is no pre-emption. This means that when a task is scheduled to execute, it blocks the current thread until the poll function returns.
Because of this, it is important for implementations of poll to only execute for very short periods of time. For I/O bound applications, this usually happens automatically. However, if a task must run a longer computation, it should defer work to a blocking pool [1] or break up the computation into smaller chunks and yield back to the executor after each chunk. [2]
"poll" here refers to the callback function in the future that actually does the work.
When you want to do long-running blocking work (either computation, or synchronous IO from some library that doesn't use futures), you usually want to farm that work out to a thread pool. I think Tokio provides convenience functions for doing that.
It's also possible to write a long-running computation as a future, which can yield after some smaller amount of work to let some IO get done. But I'm not sure if that's the recommended approach.
Most languages allocate every future (and sub-future, and sub-sub-future) separately on the heap. This leads to some overhead, allocating and deallocating space to store our task state.
In Rust, you can "inline" an entire chain of futures into a single heap allocation.
ValueTask is more of an optimization for scenarios where the Future (or Task<T>) is often intended to be ready immediately (synchronously). For those cases its wasteful to first allocate a continuation on the heap, and then not to use it - because the code after the await can directly be executed.
The introduction of ValueTask allowed C# code to only perform dynamic allocations whenever the task definitely needed to be yielded - opposed to one allocation for every Task<T>. However it doesn't allow for guaranteed avoidance of allocations - like Rusts model can do.
However on the positive side removing the allocations for those cases is probably good enough since the other yields are of low-frequency (when waiting for IO). And Rust code currently requires allocations like Box<dyn Future> in order to make certain things work (like type-erasure and dynamically dispatched interfaces) that are not necessarily required in .NET in the non-yielded case.
From an ergonomic perspective I definitely prefer .NETs easy-by-default model which is still highly optimizable up to the point of avoiding all allocations. But I understand why this wouldn't have worked for Rust and it's constraints (no GC, no classes, no runtime).
So for example, in Kotlin each piece of synchronous code within an async function is compiled into what is essentially a Java Runnable object, which must be allocated on the heap.
As far as I understand in Kotlin continuations are only allocated on the heap when necessary (the suspending function can not be executed synchronously). Therefore most allocations should be avoided until blocking for IO.
I like to think of it from the other direction. I'm super excited to use futures and async/await in embedded hardware, where "the heap" might not even exist. Hardware interrupts can be treated as events that wake the executor. That lets me write some code like this:
This reads from the serial port and echos it back inside of `><`. The code reads very cleanly, does not require the heap, and should be very efficient.
The caveat is that I am also trying to target a platform that Lust / LLVM doesn't support completely yet, so I haven't gotten this to actually work well. I bet the people using the Cortex processor are far ahead of me though.
That's exactly what I am looking forward to, the state machine generation can make a lot of code targeting embedded platforms, especially the IO part, much more comfortable.
Might I ask which controller you are using? With Rust I'm still on a Cortex, but looking to apply it on other architectures in the future
A network read or write can be in the order of hundreds of nanoseconds when using an accelerated userspace networking stack. Allocation can be a good chunk of that.
All of the data needed by a task is contained within its future. That means we can neatly sidestep problems of dynamic stack growth and stack swapping, giving us truly lightweight tasks without any runtime system implications. ... Perhaps surprisingly, the future within a task compiles down to a state machine, so that every time the task wakes up to continue polling, it continues execution from the current state—working just like hand-rolled code. [1]
[1] https://aturon.github.io/blog/2016/09/07/futures-design/