Hacker News new | ask | show | jobs
by trias 2538 days ago
Honest question, maybe obvious: Why is the java-scheduler/ thread model so much better than the OS-level one? What does the java one do, or better what does it not do? And why can't the principles which make java-scheduling fast be applied to OS-threads?
7 comments

There are a couple of issues here: scheduling, and managing the stack. The reason a runtime can do better than the kernel is that its constraints are (hopefully) less constraining than those of the kernel.

In the case of managing the stack, the kernel must be able to support languages like C/C++ and Rust, that can have internal pointers pointing into the stack itself. This makes it hard to reallocate stacks dynamically and move them around in memory. The JVM doesn't allow such pointers in application code, and internal bookkeeping pointers are known.

As to the scheduler, the kernel must support very different kinds of threads: threads that serve server transactions that block very often, and threads that, say, encode a video, that rarely block at all. These different kinds of threads are best served by different schedulers (e.g. the "frequently blocking" kind of thread is best served by a work-stealing scheduler, but that may not be the best scheduler for other kinds of threads). When you implement fibers in the runtime, you can let the developer choose a scheduler for their needs.

That sounds like the runtime has more constraints thus it can make more assumptions, not less constraints, while the OS one must be more general.
It depends on your perspective.

The OS one has more _requirements_ (the number of "it must do..." is much higher), which constrains it's design space much more. I think you can call that as "having more constraints" because it must meet more needs to be even viable as a solution.

The number of requirements for JVM threads is much lower, thus less constraining, allowing it more freedom to implement solutions that can meet its narrower window of features.

Constraints on the user free the platform. Freedoms for the user constrain the platform.

If you are interested in someone pontificating about this at length, see https://youtu.be/GqmsQeSzMdw

pron, isn't it simpler to just allocate everything on heap, than reallocating coroutine stacks? Or is allocating on heap significantly less performant than managing coroutine stacks (stack need to be copied, while heap allocated object can stay the same).

Also, is there any timeline or estimate when Loom will be released with official JDK?

> Also, is there any timeline or estimate when Loom will be released with official JDK?

From another place in this thread:

As OpenJDK has switched to time-based releases, we no longer plan releases based on features. When a feature is ready, it is merged into the mainline and released in the following release. We never commit to a timeline, but I would say that the probability for fibers to land in one of the two releases next year as quite high. Of course, we release projects gradually, and it's possible that some planned fiber features will not land in the first release, or that the performance in the first release will later be improved etc. However, early access Loom binaries (with API still very much in flux) are expected in a month or so, with the intention of gathering feedback on the API. The decision on when fibers are ready to be merged will greatly depend on that feedback.

In the current Loom implementation, continuation stacks are allocated on the heap; they can grow and shrink as ArrayLists do -- the object stays the same, but the backing array needs to be reallocated.
I meant every stack frame to be a separate heap-allocated object.
There's a few different things to unpack in that question.

1. Fibers are more lightweight than threads because we only need to carry around a small amount of state representing the stack used so far by the fiber and a few other things. So it should be possible to have many more of them than we can have OS threads.

2. We can in theory make decisions about scheduling based on what the program is doing in ways that the OS cannot do. For example if one fiber releases a lock we might know to schedule the fiber waiting for that lock in preference to anything else, and would not have to wake all fibers waiting on that lock and let them race to acquire it as commonly happens with OS threads.

3. Some of these advantages can be done with OS threads when combined with user scheduling. This is available in Windows, but has not made it into mainline Linux yet. It allows the application to give the OS useful hints on which thread to schedule next, but it doesn't really reduce the weight of the threads themselves.

(I work for Oracle, and spend some of my time on project loom. These opinions are my own.)

Regarding point 2)

That is not how OS level threads would work. When a lock is released, the next ready-to-run task (blocked on that lock) will be made runnable. They wont all be released then 'race' to acquire the lock. The behaviour is identical in user-space or kernel-space.

That depends on the synchronization mechanism used, and whether or not that mechanism is hooked into the kernel scheduler. Sometimes it is (such as pthread_mutex) and sometimes it is not (such as rolling your own spin locks).
ok, any "sane" implementation would not use spin-locks for this purpose, but yes it is technically possible to do so.

Spin-locks are not really the mechanism of choice for this level of abstraction.

The OS thread model is lightweight processes - each OS thread is essentially a separate process sharing a common section of the process control block. Context switches are still relatively heavyweight involving a kernel mode switch, CPU state save and restore.

Application threads are a part of the application runtime environment. Thread scheduling is little more than a function call and doesn't involve a syscall invocation, and can be hooked into runtime environment trigger mechanisms.

Kernels have to be one size fits all, application threads can be tuned more precisely, with greater visibility over the application's state.

Seems like your saying application threads are “green threads”, when the jvm has used system threads one to one for a long time now. So for the jvm at least there’s no difference between a jvm thread an an OS thread. Some jvm impls may still use green threads, but the major ones haven’t for many years. Apologies if I’ve misunderstood.
OS threads use statically allocated stack. This limits the nubmer of threads one can have (say if stack is 1MB and you have 1GB of memory, you can have 1024 threads max). That's the main difference. I think some savings also come from avoiding full blown context switches.
You only lose address space as long as that stack memory isn't used, right?
Good point, yes. As mentioned in a sibling comment, even 4k for one page of really allocated memory is significant.
This is demonstrably not true. That address space isn’t allocated until it is used. Even with initial allocations of 4kB pages, there’s plenty more space than you’re estimating.
No you have it backwards - the address space is always allocated. The physical memory is not allocated (we say committed) until it is used.

But allocating the address space even if it is not committed still consumes finite resources and still limits the number of threads you can create.

Yeah, but there’s plenty of space to put 2MB stacks in 48 bits (that’s 256 TBs) of address space.
Again, there's plenty of address space in 48 bits, but you need to commit physical memory in order to allocate the space for the page table for that address space. And that's per-process, and it's going to thrash the TLB.
That's not how the OS's page management tables work. The OS assigns space for the stack at a (typically random) location in the process' address space, but no physical allocation is made. The very first time the stack is used a page fault exception is generated, causing a context switch to the OS. Only then does the memory management subsystem allocate a page for the stack and return control back to the program.

Handling memory allocation lazily like this is necessary to handle a number of edge cases, such as spinning up a massive number of short-lived threads. It also prevents thrashing of the TLB cache.

In practice, real operating systems finely tune their behavior here. I would not be surprised at all if a 4kB allocation is made for a thread's stack upon creation in modern operating systems. But I would be very surprised if, e.g., Linux allocated a full 1MB of memory at thread creation time instead of handling the vast majority of it lazily.

EDIT: Oh wait, I think you were mostly agreeing with me :) Yes, my original comment did mess up address allocation vs physical page allocation due to a brain fart. I meant it the other way around and I think we're saying nearly the same thing.

The one major point of difference is that to make an address allocation in the page table doesn't require a physical allocation. The OS can either leave that allocated space unconfigured, or assign it a protected page table. In either case it faults on access and the OS knows that before killing the process with core exception to first look in its internal lazy delayed-allocation tables to see if it the access was to an allocated area of the address space with deferred allocation.

Even 4k is a lot, and whatever memory is committed stays committed (and worse, it may even be paged). Perhaps it would be possible for the kernel to uncommit unused stack pages, but I don't think kernels want to assume threads never access memory below their stack pointer.
Oh I agree OS threads a really heavyweight and there is a real need for fibers / green threads. I think we're arguing over whether they are 100x more efficient or 10,000x more efficient. Details matter ;) Thanks for your work on this for Java!
I think you severely overestimate the efficiency gain both times.
The lightweight threads(fibers) are actually not threads on the OS level, they just behave similarly. As such there are no context switches etc., which are slow.
i know, but why doesn't the OS do this too? Most programmers don't care if their "thread" is an actual thread on the machine or just a "fiber". If the benefits are this large, it should be provided by the OS in my opinion.
Windows provides fibers in addition to threads. Apparently it’s a quite good API although not widely used:

https://nullprogram.com/blog/2019/03/28/

OS scheduling requires you to switch contexts between different modes in the CPU. Essentially, the OS has the power to perform operations that regular code does not. Threads and processes are usually built on top of these additional permissions.

Context switching is relatively expensive. The CPU needs to push a lot of application state out of the way to clear the path for the OS code to run, then after the OS is done, push the OS out of the way and retrieve the application's state and code again.

Whereas a fiber remains entirely inside the application code. It never requires a context switch. But a fiber loses some of the powers of the OS: for example, it can't draw hard memory boundaries between fibers that will be enforced by the CPU.

For some things you want processes, for some threads, for some fibers.

Context switching doesn't need to be as expensive as it is in these cases, it's just that Linux doesn't provide the mechanisms needed to make it more efficient. See, e.g., https://blog.linuxplumbersconf.org/2013/ocw/system/presentat... which implemented a syscall for a process-directed context switch directly to another thread.
If the OS were to do it, it would involve a switch to kernel-space and would then be termed an OS-level thread...

Another way of putting it is that yes, Operating Systems do do it. Thats what an OS-level thread is.

They do: it’s called threads.

It’s the OS context switch that makes process threads so much more costly than user-scheduled fibers. You cannot involve the OS if you want efficiency.

Otherwise it is the same exact thing.

> If the benefits are this large, it should be provided by the OS in my opinion.

Like with most things, there are drawbacks as well as benefits.

The OS doesn't know the details of what things are specific to a given "thread". So it has to take a "big dumb sledgehammer" approach to switching tasks: it has to switch out the whole stack (4k or 8k) even when there are only a few bytes that belong to that particular task (because it has no way of knowing which bytes those are), and it has to interrupt the tasks at essentially arbitrary times rather than waiting for them to yield (because, again, it has no way of knowing what the yield points are) which then means the tasks have to have more synchronization overhead etc. to work around the fact that they might be preempted at any point.

In this age of VMs/containers for everything I'm not convinced a conventional OS offers a lot of value - OSes made sense when programs needed to access different kinds of hardware and we liked to have multiple processes/users sharing a single machine while broadly trusting each other, but neither of those things is really true any more. Look at unikernels for where I think the future is going - bootable VMs that act as a language runtime that controls things like threading directly, no need for an OS intermediary.

> it has to switch out the whole stack (4k or 8k) even when there are only a few bytes that belong to that particular task (because it has no way of knowing which bytes those are)

This reads to me like you believe the OS must memcpy/move the whole stack out of the way on a context switch. It doesn't; the other thread has other, dedicated memory its stack, and on a context switch, the stack pointer is simply adjusted to point at the other stack.

Assuming Java's fibers work similar to other green thread implementations, green threads/fibers work similarly — it's just a pointer change, except the adjustment is done in userspace.

(And, to some degree, the OS does know what bytes are stack for any given task. It's whole pages, yes, but that allows the program to manage it otherwise. But I think most green-thread/userspace threads are similar: the stack is preallocated ahead of time and left to the thread to manage. Sure, you might not know the exact range, but you don't really need to? Go, I think, is an interesting outlier here; IIRC, it dynamically expands and contracts the allocated space on the stack in response to the application's demands, though I do think they had some interesting issues w/ loops thrashing allocations if they fell along an allocation boundary. I think they've also long since fixed that issue.)

Java's fibers actually work like the previous poster described, which may explain the confusion:

> The current prototype implements the mount/dismount operations by copying stack frames from the continuation stack – stored on the Java heap as two Java arrays, an Object array for the references on the stack and a primitive array for primitive values and metadata. Copying a frame from the thread stack (which we also call the vertical stack, or the v-stack) to the continuation stack (also, the horizontal stack, or the h-stack) is called freezing it, while copying a frame from the h-stack to the v-stack is called thawing. The prototype also optionally thaws just a small portion of the h-stack when mounting using an approach called lazy copy; see the JVMLS 2018 talk as well as the section on performance for more detail.

Source: https://wiki.openjdk.java.net/display/loom/Main

The video presentation describes it better. I think in their expected usage scenarios the stacks of fibers aren't particularly deep so the copying isn't that expensive. Also, IIRC, doing it this way was less intrusive to the existing JVM architecture; it's possible in time they'll rearchitect things to use a more traditional technique.

We don't have a JavaOS outside smartcards yet, which is basically a custom Java runtime with the IDT handlers in it. :) If that happens, we may get that missing super efficient implementation.
Simply: less stuff to do. Saving a few stack frames vs a transition into kernel space and subsequent hardware calls.

I imagine Java can do some party tricks with lock sharing too with fibers backed by the same OS thread.