Hacker News new | ask | show | jobs
by packetlost 695 days ago
Yeah, and it's generally good to be RAM limited instead of CPU, no? The alternative is blowing a bunch of time on syscalls and OS scheduler overhead.

Also the virtual threads run on a "traditional" thread pool to my understanding, so you can just tweak the number of worker threads to cap the total concurrency.

The benefit is it's overall more efficient (in the general case) and lets you write linear blocking code (as opposed to function coloring). You don't have to use it, but it's nice that it's there. Now hopefully Valhalla actually makes it in eventually

1 comments

The OS scheduler is still there (for the carrier threads), but now you've added on top of that FJ pool based scheduler overhead. Although virtual threads don't have the syscall overhead when they block, there's a new cost caused by allocating the internal continuation object, and copying state into it. This puts more pressure on the garbage collector. Context switching cost due to CPU cache thrashing doesn't go away regardless of which type of thread you're using.

I've not yet seen a study that shows that virtual threads offer a huge benefit. The Open Liberty study suggests that they're worse than the existing platform threads.

> The OS scheduler is still there (for the carrier threads), but now you've added on top of that FJ pool based scheduler overhead.

Ideally carrier threads would be pinned to isolated cpu cores, which removes most aspects of OS scheduler from the picture

> I've not yet seen a study that shows that virtual threads offer a huge benefit.

Not exactly Java virtual threads, but a study on how userland threads beat kernel threads.

https://cs.uwaterloo.ca/~mkarsten/papers/sigmetrics2020.html

For quick results, check figures 11 and 15 from the (preprint) paper. Userland threads ("fred") have ~50% higher throughput while having orders of magnitude better latency at high load levels, in a real-world application (memcached).

The study says there's surprising performance problems with Java's virtual thread implementation. Their test of throughput was also hilarious, they put 2000 OS threads vs 2000 virtual threads: most of the time OS threads don't start falling apart until 100k+ threads. You can architect an application such that you can handle 200k simultaneous connections using platform-thread-per-core, but it's harder to reason about than the linear, blocking code that virtual threads and async allow for.

> Context switching cost due to CPU cache thrashing doesn't go away regardless of which type of thread you're using.

Except it's not a context switch? You're jumping to another instruction in the program, one that should be very predictable. You might lose your cache, but it will depend on a ton of factors.

> there's a new cost caused by allocating the internal continuation object, and copying state into it.

This is more of a problem with the implementation (not every virtual thread language does it this way), but yeah this is more overhead on the application. I assume there's improvements that can be made to ease GC pressure, like using object pools.

Usually virtual threads are a memory vs CPU tradeoff that you typically use in massively concurrent IO-bound applications. Total throughput should take over platform threads with hundreds of thousands of connections, but below that probably perform worse, I'm not that surprised by that.

> Except it's not a context switch? You're jumping to another instruction in the program, one that should be very predictable. You might lose your cache, but it will depend on a ton of factors.

Java virtual threads are stackful; they have to save and restore the stack every time they mount a different virtual thread to the platform thread. They do this by naive[0] copying of the stack out to a heap allocation and then back again, every time. That's clearly a context switch that you're paying for; it's just not in the kernel. I believe this is what the person you're replying to is talking about.

[0] Not totally naive. They do take some effort to copy only subsets of the stack if they can get away with it. But it's still all done by copies. I don't know enough to understand why they need to copy and can't just swap stack pointers. I think it's related to the need to dynamically grow the stack when the thread is active vs. having a fixed size heap allocation to store the stack copy.