In those cases I like the coroutines/user-space-threading. It gives you the reduced cost of having a single or a few threads without the heavy toll of callbacks.
When you have 10,000 tasks and about 8 cores (give or take a few) the number of context switches is very large. Switching in the kernel will happen mostly in the system call boundary of blocking IOs and require the scheduler to make a decision on what thread to wake up next and then change the running process.
This can be seen in function context_switch inhttps://github.com/torvalds/linux/blob/master/kernel/sched/c... without the arch dependent components and can hardly be compared in complexity and effort to switching between 4 and 8 registers in user-space.
The above still doesn't include any changes to the TLB and memory protection tables as I assume the OS optimized those away when it switched between two threads of the same program. An optimization I'm not sure that happens normally.
When you have 10,000 tasks and about 8 cores (give or take a few) the number of context switches is very large. Switching in the kernel will happen mostly in the system call boundary of blocking IOs and require the scheduler to make a decision on what thread to wake up next and then change the running process.
This can be seen in function context_switch inhttps://github.com/torvalds/linux/blob/master/kernel/sched/c... without the arch dependent components and can hardly be compared in complexity and effort to switching between 4 and 8 registers in user-space.
The above still doesn't include any changes to the TLB and memory protection tables as I assume the OS optimized those away when it switched between two threads of the same program. An optimization I'm not sure that happens normally.