Hacker News new | ask | show | jobs
by oldgradstudent 885 days ago
Java virtual threads did not cause a deadlock here.

The deadlock was a usage error.

A better title would be: Naively switching to Java virtual threads caused a deadlock in TPC-C for Progress SQL.

5 comments

Submitted title was "Java virtual threads caused a deadlock in TPC-C for PostgreSQL". We've reverted it now to the article's own title (truncated to fit HN's 80 char limit).

"Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html

I would like to reply with a quote of sir Tony Hoare's 1980 ACM Turing Award Lecture: "There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies".
This one was a very large caveat pointed out loudly every time the feature was mentioned to the community. So, yes the limitation/flaw was very well known and is ideally going to be addressed in a future JDK release.
> This one was a very large caveat pointed out loudly every time the feature was mentioned to the community.

It really wasn't. There were people on here, including Oracle employees, claiming that the virtual thread implementation was a drop-in replacement that would work (not necessarily perform better, but work) in all cases.

And it indeed does in common usage scenarios. And also in this case once the issue with `synchronized` is resolved. After all, this is a benchmark and it's not surprising that one of the limitations of the design was hit.
"common usage scenarios" != "in all cases".

I don't know if those Oracle employees actually did outright say -- or even imply -- "in all cases" as the GP asserted, but if they did, then "only" working in "common usage scenarios" would definitely be overselling the feature.

I find it unlikely as well that they said it would just work in all cases. But since it's going to work out eventually and a workaround exists, they would actually not be that wrong with that statement.
If your program was prone to deadlock as is, and is just more easily happening with virt threads, it means that the problem is your code.
If your program requires 5 OS threads to be able to make progress (say, there are operations for which 4 threads are blocked while the 5th does work), and the runtime only spawns 2 OS threads and tries to schedule your 5 now virtual threads on those 2 OS threads, then your old program was perfectly correct and the virtual threads runtime has broken it.

This was a known and advertised failure case with the new threads runtime, exacerbated by limitations in the implementation that cause certain blocking operations to block the current OS thread instead of blocking the virtual thread and allowing another virtual thread to be re-use the underlying OS thread.

Having X resources and Y dedicated threads that operate on them, where Y > X, and allowing a thread to block in a way that requires the assistance of another thread to make progress but only when it's holding a resource, is a perfectly reasonable, standard, and safe design. When a change to the runtime silently reduces the number of synchronized sections a program can enter concurrently, it's not at all surprising that this breaks working code.
Concurrency plus parallelism is complicated. There is no going around that.
Article won’t load for me. My understanding was that the switch to virtual was supposed to be relatively simple, allowing the programmer to be naive. But a user deadlock is a user deadlock, no matter the threading impl.
The issue here is currently virtual threads don't work well with the 'sychronized' keyword. Right now synchronized will pin the carrier thread. The fix was to switch to a higher-level abstraction that works with virtual threads.

My understanding is there is work to make synchronized not pin the carrier thread, but that's some pretty complex and important code to change.

From a relatively brief skim and past Go and Java experience: synchronized blocks the current normal thread, so that doesn't really seem any different to me. If you starve your threads, you starve your threads.

It definitely leaves room to optimize by not pinning that thread, which would be great, but that shouldn't change semantics at all. Or is there something actually screwed up in the implementation of virtual threads that makes this a much bigger issue?

It’s a thread that supports n virtual threads. You want synchronized in virtual thread a and not the carrier thread which will block all the virtual threads.

Been away from Java land for a while. How did something like that even get into release? That’s like a pretty big loaded shotgun to leave lying around with lots of kids playing, no?

It's an explicitly documented shortcoming of the existing implementation that will be fixed soon. I knew immediately from the title of TA what probably happened. The other similar limitations (CPU-bound tasks, native calls) seem much more severe, but are ultimately unsolvable. Meanwhile, the issue with synchronized is regarded as a scalability bottleneck since the JDK is supposed to temporarily spawn additional platform threads. This behavior can be controlled via the system property `jdk.virtualThreadScheduler.maxPoolSize`.

Also, this is a benchmark. It's not surprising that they managed to produce a situation where more than n_cores virtual threads would actually start waiting.

Appreciate the reply. Hope you get it out soon since many /g do not read documentation and synchronized semantics changing is a ‘surprise’, specially since this one of those bugs that is a nursery for heisenbugs.
It spawns a new carrier thread in place, up to a certain, configurable limit. But starving carrier threads will also result in effectively live locks, so that’s not a solution.

So I don’t see the big fuss about it - don’t spawn a million virtual thread that all just spams synchronized?

I agree. Seems like a huge Java design error.
Java's virtual threads are supposed to be a drop-in replacement for real threads. But using virtual threads means you get a far smaller number of real threads, and things that were safe back when you had an unlimited number of real threads available (or at least, a larger number than your database connection pool) are no longer safe.
Shouldn't you be able to use the same number of real threads though, plus some additional effectively-threads for the virtual threads that are not pinned? Doesn't seem like this should change semantics there, so the risk would be code that changes because of perceived advantages which are not true in edge cases - that's new behavior that wasn't possible before, there aren't really any existing semantics to break.

If they're, like, limiting to CPU cores * 2 threads: yeah that would be Bad™. Unambiguously. I haven't been able to find anything conclusive about this though.

That sort of what happens, there is just a configurable hard limit on how much new thread may be created that was hit by this benchmark.

As mentioned in another comment: jdk.virtualThreadScheduler.maxPoolSize

In this case the synchronized blocks are released by calls to Object.wait, so that code would not deadlock with normal threads.

The issue is that Object.wait doesn't suspend virtual threads, so you get deadlocks. The answer is to reimplement usages of the wait/notify pattern to use locks or concurrent collections (for example, using a concurrent message queue for the producer/consumer pattern, which is a common use case for synchronized/wait/notify).

The adoption guide tells you not to switch to virtual threads just for their own sake. They're not meant as a straight replacement for OS threads.

https://docs.oracle.com/en/java/javase/21/core/virtual-threa...

No mentions of deadlocks. Just this:

> Pinning does not make an application incorrect, but it might hinder its scalability.

The documentation is wrong.

Yes, it is. That’s because their discussion of pinning is incomplete: it needs to mention forward progress.
The workaround is to increase the carrier thread pool size.
This has nothing to do with postgres though. It’s part of a generic JDBC connection pool:

> The problem is that this synchronized code might be deeply embedded within the libraries you use. In our case, it was within the c3p0 library. So, the fix is straightforward: we simply wrapped the connection with a java.util.concurrent.Semaphore.

I bet if you just checked out connections and slept a random amount of time you’d have the same problem.

What was the user error? Was there something obvious they did or didn't do or is it a "you're holding it wrong" kind of issue?
c3p0 appears to have been working under the assumption that their threads can always run when whatever they’re waiting on is done. Their only dependency is the OS giving them cycles.

Virtual threads changed the contract a little bit. Now one virtual thread running certain code can prevent a different virtual thread from ever getting any cycles even though they are not dependent on each other in the Java code. It’s a side effect of the current Java implementation.

The rules changed, and it tripped up c3p0. Unless they explicitly said somewhere that they were completely ready for virtual threads I’m not sure anyone is at fault here.

It is hard to be confident whether the entire dependency tree is free of this issue.