Hacker News new | ask | show | jobs
by continuational 2502 days ago
I don't quite follow. What exactly is the overhead that other languages have for futures that is eliminated here?
3 comments

Going down the same rabbit hole earlier this week, found this to be a good explanation:

All of the data needed by a task is contained within its future. That means we can neatly sidestep problems of dynamic stack growth and stack swapping, giving us truly lightweight tasks without any runtime system implications. ... Perhaps surprisingly, the future within a task compiles down to a state machine, so that every time the task wakes up to continue polling, it continues execution from the current state—working just like hand-rolled code. [1]

[1] https://aturon.github.io/blog/2016/09/07/futures-design/

> Perhaps surprisingly, the future within a task compiles down to a state machine, so that every time the task wakes up to continue polling, it continues execution from the current state

How are those tasks implemented, and what's scheduling them?

You need a third-party library to provide an executor. Rust does not come with one to keep the runtime size small. The community seems to have centralized on https://tokio.rs/ (under the hood it uses epoll/whatever OS-specific functionality to schedule M:N)

See: https://news.ycombinator.com/item?id=20722297

How's that play out in terms of, say, performing some long-running calculations, rather than something that performs IO?
(Of course, this is Tokio-specific)

Cooperative scheduling is used to schedule tasks on executors. A single executor is expected to manage many tasks across a small set of threads. There will be a far greater number of tasks than threads. There also is no pre-emption. This means that when a task is scheduled to execute, it blocks the current thread until the poll function returns.

Because of this, it is important for implementations of poll to only execute for very short periods of time. For I/O bound applications, this usually happens automatically. However, if a task must run a longer computation, it should defer work to a blocking pool [1] or break up the computation into smaller chunks and yield back to the executor after each chunk. [2]

"poll" here refers to the callback function in the future that actually does the work.

[1] https://docs.rs/tokio-threadpool/0.1.15/tokio_threadpool/fn....

[2] https://tokio.rs/docs/internals/runtime-model/

When you want to do long-running blocking work (either computation, or synchronous IO from some library that doesn't use futures), you usually want to farm that work out to a thread pool. I think Tokio provides convenience functions for doing that.

It's also possible to write a long-running computation as a future, which can yield after some smaller amount of work to let some IO get done. But I'm not sure if that's the recommended approach.

You're essentially implementing statically scheduled cooperative userspace threads on top of a single thread with async/await.

So, for computation, this is a bad solution.

This is a great quote, and one that I missed while first writing the post! I've added it now.
Most languages allocate every future (and sub-future, and sub-sub-future) separately on the heap. This leads to some overhead, allocating and deallocating space to store our task state.

In Rust, you can "inline" an entire chain of futures into a single heap allocation.

In .NET something similar is possible via ValueTask.
ValueTask is more of an optimization for scenarios where the Future (or Task<T>) is often intended to be ready immediately (synchronously). For those cases its wasteful to first allocate a continuation on the heap, and then not to use it - because the code after the await can directly be executed.

The introduction of ValueTask allowed C# code to only perform dynamic allocations whenever the task definitely needed to be yielded - opposed to one allocation for every Task<T>. However it doesn't allow for guaranteed avoidance of allocations - like Rusts model can do.

However on the positive side removing the allocations for those cases is probably good enough since the other yields are of low-frequency (when waiting for IO). And Rust code currently requires allocations like Box<dyn Future> in order to make certain things work (like type-erasure and dynamically dispatched interfaces) that are not necessarily required in .NET in the non-yielded case.

From an ergonomic perspective I definitely prefer .NETs easy-by-default model which is still highly optimizable up to the point of avoiding all allocations. But I understand why this wouldn't have worked for Rust and it's constraints (no GC, no classes, no runtime).

Thanks for the thoughtful reply.

You are right, however there is a scenario that you have forgotten.

The possibility that the JIT might eventually optimise the code path given enough PGO data, and apply RVO or similar approaches.

Naturally I don't know if RyuJIT can already do this kind of optimization, given that only recently they have strengthened their focus on this area.

However this kind of optimizations are already doable in Graal, so it is possible that Microsoft also adds them to RyuJIT.

Which in any case would rule out some of the Rust deployment scenarios I guess.

Has anyone ever done a comparison to see how much overhead this actually adds? I'd be really curious to see this represented in concrete terms.
So for example, in Kotlin each piece of synchronous code within an async function is compiled into what is essentially a Java Runnable object, which must be allocated on the heap.
As far as I understand in Kotlin continuations are only allocated on the heap when necessary (the suspending function can not be executed synchronously). Therefore most allocations should be avoided until blocking for IO.
Ah. What kind of asynchronous task executes so fast that a heap allocation is measurable?
I like to think of it from the other direction. I'm super excited to use futures and async/await in embedded hardware, where "the heap" might not even exist. Hardware interrupts can be treated as events that wake the executor. That lets me write some code like this:

    async fn bracketed_echo() -> ! {
        loop {
            let mut buf = 0;
            Serial.rx(&mut buf).await;
    
            Serial.tx(b'>').await;
            Serial.tx(buf).await;
            Serial.tx(b'<').await;
        }
    }
This reads from the serial port and echos it back inside of `><`. The code reads very cleanly, does not require the heap, and should be very efficient.

The caveat is that I am also trying to target a platform that Lust / LLVM doesn't support completely yet, so I haven't gotten this to actually work well. I bet the people using the Cortex processor are far ahead of me though.

That's exactly what I am looking forward to, the state machine generation can make a lot of code targeting embedded platforms, especially the IO part, much more comfortable. Might I ask which controller you are using? With Rust I'm still on a Cortex, but looking to apply it on other architectures in the future
A network read or write can be in the order of hundreds of nanoseconds when using an accelerated userspace networking stack. Allocation can be a good chunk of that.