Hacker News new | ask | show | jobs
by pron 1799 days ago
There's no doubt forced preemption could help, but I'm still unsure about what the right algorithm is; probably not time sharing.

Suppose you have 100K threads, and only 1% of them become CPU-bound for 100ms. That could take down your 32-core server for 3 seconds, which is bad. But suppose we had 10ms time-slices. Then, those busy threads' latency might go from 100ms to as high as a few minutes, which means effectively taking them down. The scale has a qualitative effect here. So, rather than time-sharing, it might be better to optionally install some other preemption policy -- maybe something that indefinitely suspends threads that behave badly too often and puts them in some collection.

The point is that time-slicing will probably not be helpful in sufficiently many cases, and we don't yet know what will. We'd like to gather more data before offering something. In some other languages/runtimes it might be worthwhile to just expose a capability and see what people do with it, but with Java, within five minutes you'll have twenty libraries doing time-sharing, and thousands of people using them blindly whether it's good or bad for them (just because they say they do time-sharing, and that's good, no?), and now there's just noise and bad habits everywhere. This is nanny-state governance, but we've learned our lesson, and you can't be too careful with an ecosystem this big.

1 comments

Sure. Ultimately you've got the same problem as an OS scheduler and recognising whether threads are CPU-bound or IO-bound and treating them separately is probably going to be part of that.

I appreciate not wanting to do things until you can do them right, but equally if you advertise this as a preemptive runtime, people are going to expect that they can use it to throw 32 CPU-spinning threads onto 8 cores and have it behave gracefully. It sounds like from a user's point of view on day 1 this runtime will be the worst of both worlds - you need to take care to not do big chunks of CPU work without yielding, but you don't get the full control that a traditional "userspace" cooperative multitasking framework would give you.

But the experience people have already been having with the Early Access is overwhelmingly positive. Even without forced preemption, "preemptive" is far less misleading than cooperative, even considering the common confusion between preemptive scheduling and time-sharing.

While OS threads might indeed handle 32 spinning threads on an 8-core machine more gracefully, switching between implementations of threads is easy so such a "mistake" is inconsequential, and no OS handles 320,000 spinning threads gracefully, and people know that that is the scale of threads that virtual threads exist to serve.

You're right that calling it "cooperative" would be worse. Still, I suspect Early Access users are paying a lot more attention to the details (and are more knowledgeable users in general) than GA users will; switching thread implementation might be "easy", but I suspect most users will want to use Loom without tuning anything at all. So safe defaults are very important (and I'd suggest that for the default config, safely handling 32 spinning threads on 8 cores is more important than handling 320,000 mostly-sleeping IO-bound threads).

Which is not to say I have a better idea (other than "make the defaults magically do everything right", which is obviously hard).

I think we have the best defaults currently possible for the use-cases Loom targets, say, more than a few thousand concurrent tasks. The cases where you might observe some downside compared to the OS (before we choose to expose forced preemption) are not in that class. The only thing to consider is whether you have many concurrent tasks or a few, and if the answer is many, the choice is simple. Otherwise, you can experiment with different implementations, but the few-tasks case is not our initial focus.

Having said that, I'm interested in hearing about real-world cases (involving many tasks, not 32) where forced preemption, and possibly time sharing or maybe another strategy, can be useful. The "accidentally misbehaving subset" is a good example, but time-sharing probably isn't what we need to address it.