Hacker News new | ask | show | jobs
by saman_b 3518 days ago
Hi, I developed uThreads. I looked at lthreads quickly, and it seems lthreads only maps multiple coroutines onto a single pthread (N:1). Although, it adds the possibility of running multiple pthreads, but each pthread can only run their local lthreads (using M threads that do N:1 mapping). However, in uThreads, uThreads can be multiplexed over multiple pthreads (thus M:N mapping). Also lthreads scheduler is based on epoll/kqueue per pthread, and uThreads is using run Queues to manage uThreads which has less overhead. Per pthread epoll/kqueue can mean better scalability for large number of threads in comparison with uThreads that is relying on a single poller thread. But since the poller thread and synchronization is very low overhead in uThreads, the scalability is not an issue (Experiments to up to 16 threads show that uThreads scale very well). Although lthreads provide compute boundaries and async IO to move lthreads over other pthreads, but this process seems to be very expensive. uThreads does not provide these features, but it provides more flexibility and control to the developer by providing migrations. Developers can use migration at any point to move the uThread to another set of kThreads to execute tasks asynchronously (By defining Clusters of kThreads, e.g., IO cluster or Compute Cluster).
2 comments

I recommend trying to get access to larger machines with more hardware parallelism. I have seen techniques that scale just fine to using 16 threads, but hit serious limitations when you get to over 100 threads.
You are right, I have access to machines with higher number of cores, but they have multiple sockets and at some point I need to address the cross NUMA cost which adds a whole new level of complexity and design decisions.

For sure at some point the poller thread will be saturated and the program will not scale past a certain number of threads. I used to have a poller thread per cluster for better scalability, but that would add overhead for migrations between clusters, thus I had to remove it for now until I can somehow find a low overhead solution. uThreads is a work in progress and all these need to be carefully considered in the future :) Thanks for your feedback

Sometimes, the techniques you use to scale to 100s of threads solve some NUMA issues by virtue of the fact that in order to scale that high, you need to avoid touching as much non-local data as possible. I think it's better to just deal with the pain now and start running your experiments on as large of a machine you can get access to. You can still put off explicitly designing for NUMA, but you want to avoid spending too much time and effort designing for the lower end of the scalability spectrum.
> uThreads is using run Queues

This sounds like the Linux kernel. I'm curious to understand why copying this logic into user space is worth while?

I am not sure if you are referring to the runQueue being used in Linux or the whole approach. I try to answer both:

Run queues can be part of any scheduler, they are queues with runnable tasks.

But as why the approach is worth while in user space, it has to do with low cost of operations (context switches) in user space, and also using cooperative scheduling instead of preemptive. Cooperative scheduling provides more control to the user over the tasks, and also has lower overhead since there is no need to manage a quantum for each taks (thread).