Hacker News new | ask | show | jobs
by omazurov 1770 days ago
>... delimited continuations (of some form)

I expect internal implementation of continuations in Loom to have non-negligible overhead which may be justified for heavy blocking I/O operations but not for CPU-bound workloads. One benchmark I'd like to see (I may want to create it myself at some point) would run a large number of light threads (fibers) which would exchange data via blocking queues (to mimic goroutines with channels). No I/O, a totally CPU-bound workload with a lot of concurrency and a good potential for scalability. Somehow, I'm not very optimistic about Loom's performance in that context (please tell me if and where I'm wrong).

3 comments

1. CPU-bound problems that are not externally driven by I/O events fall under the purview of parallelism (shortening the latency of performing a single task by splitting it into cooperating tasks that employ multiple processing units) rather than concurrency (increasing the throughput of handling multiple competing, largely independent events, using scheduling). In Java, parallel streams and the low-level ForkJoinTask were made to address parallelism, whereas Loom's virtual threads are designed to tackle concurrency. It is true that some parallel use-cases are more conveniently expressed with continuations, but more on that below.

2. For CPU-bound workloads, virtually any overhead that isn't zero can come to dominate in many situations. The overhead can be made to be zero when the tasks are known in advance and can be inlined. This happens in two situations: generators, where there is one producer that can be inlined together with the consumer, and parallel workloads where all the tasks are the same. I am not aware of any single implementation, or even a single user-facing construct, that gives you the optimal experience both in those situations and in I/O-driven concurrency. If you have only one continuation construct, you must choose whether you want to favour one or the other. C++ favours the former, while Java, like Go, the latter, partly because Java already has a decent parallelism construct, and partly because we believe that the concurrent use-case delivers more value to more people. If parallel use-cases that benefit from continuations and/or generators come in high demand, we can address them later.

For a CPU bound workload you’ll be better off using one «full» thread per cpu in a pool and submitting tasks to said pool.

Then again, the same would be true in Go. Loom simply brings goroutines to the JVM.

Go might perform better, but that can be due to many other things not specific to the concurrency implementation (like better memory utilization because not every struct is stored on the heap in Go)

I thought the promise was we'd use Threads for everything and forget to worry about blocking calls. I can't run one OS thread per CPU in my proposed benchmark: I want to run millions of threads. Reimplementing it in a different concurrency model would mean Loom is hopeless for such workloads. Go may perform better but I don't think it would perform well (I see, I now have to implement my benchmark in go as well). I just don't believe their scheduler is so good.
> I thought the promise was we'd use Threads for everything and forget to worry about blocking calls.

That is the promise, yes, but you specifically mentioned a CPU-bound task. A CPU-bound workload cannot hope to do much better than doing a concurrent task spread equally among the available CPU's, spawning more threads will just add coordination overhead.

If you're spawning one million threads, each of which will perform a blocking operation, then you're likely not CPU bound (unless you're doing a System.sleep(1) or the equivalent).

> I just don't believe their scheduler is so good.

I believe you pick your own scheduler. By default it uses ForkJoin, which is supposed to be pretty good, uses the same conceptual implementation as Go (work stealing), but you're free to use something else.

> If you're spawning one million threads, each of which will perform a blocking operation, then you're likely not CPU bound

Imagine a DAG with million nodes. Each node takes data from all its input edges, processes it and sends to all its output edges. Now that I have light threads I want to use them to implement all nodes. Edges are naturally implemented as blocking queues. If I feed enough data into this structure I get a CPU-bound workload with a lot of concurrency (and scalability). Yet the nodes will have to block on their input queues because data will not be ready in most cases.

Now, to complicate the design, I want to have I/O tasks that read data from a file/network and feed it into the DAG and send its resulting output to the UI thread. Should I use different "threads" for different parts of the design? I'd hate to, especially given the fact that all those parts are dynamic in nature.

So, of course, I'm not currently spawning a million threads to execute all those tasks but Loom says I could do that for all my tasks and my concern is that to get decent performance I'd have to explicitly separate my "CPU-bound" tasks from my "I/O-bound" tasks and use different kinds of threads for each (which I'm kind of doing right now but again the promise was...).

In that case you should be fine. It should work as well as goroutines and async/await in other languages.

As long as you do IO or other blocking operations which block for a significant amount of time (ms instead of ns), Loom will be beneficial. Otherwise, you'd do better with regular threads in a pool, but even then I'd imagine the performance difference to be tiny.

It's not answering your question (indeed what you've described is actually very similar to my entity simulation example), but it sounds like the actor model might be a good fit, FWIW.
What's the use case for that kind of behaviour? AIUI the main point of Loom is so that we can go back to writing in (logical) thread-per-request style for IO-bound network servers. If you've got a CPU-bound workload then it's presumably not interactive so writing it in high-throughput high-latency batch style would be more natural, so it's hard to think why you'd want millions of threads for a CPU-bound workload. (I guess maybe some kind of entity simulation thing where you want to write one thread per entity? But I'm not sure you'd expect a huge speedup from that versus just doing the whole thing single-threaded).
Please see my example below.