Hacker News new | ask | show | jobs
by javert 4863 days ago
There are two fundamental ways of doing this: pipelining and worker-threads. In the pipeline model, each thread does a different task, then hands off the task to the next thread in the pipeline.

Why not just implement the pipeline entirely in one thread, and then replicate them (just like worker threads)?

What will happen is that the first worker thread will be executing stage 2, while the second worker thread is executing stage 1. The OS will automatically schedule them on different cores.

Am I missing something?

2 comments

When there is high-contention for a resource, it's better that one thread do it and access it contention-free, rather than make multiple threads content for it.

Even so-called "lock-free" synchronization has locks, they are just very short (30 clock cycles). Therefore, you still want to avoid contention if you can figure out a way to do it.

I didn't really go into enough detail in my example, but pulling packets off the network is a good example. You can have one thread do it, and therefore need no contention. Then you can setup multiple single-producer/single-consumer ring-buffers to forward those packets to worker threads to complete the processing of the packet. Thus, you essentially get rid of all the atomic/lock-free contention you would otherwise have.

* When there is high-contention for a resource, it's better that one thread do it and access it contention-free, rather than make multiple threads content for it.*

Right, and that's what I'm suggesting. So, if you have a pipelined architecture, keep the pipeline inside worker threads, instead of across them, except when you need to distribute work to the workers (i.e., the first stage, where you do something like take packets from the network). I think we agree on all that. I was just curious if there was ever a reason to do it the other way, i.e., having a separate thread for each stage of the entire pipeline. It seemed like you were suggesting that was useful in some cases, but perhaps I'm reading into things too much.

If a certain stage has global state a pipeline architecture may result in lower contention because that state is only accessed by one core and thus doesn't have to be locked. Lock-free producer-consumer communication between stages can be efficient. (AFAIK LineRate just took pipelining to the bank.) If your app has no global state a run-to-completion aka worker architecture is more efficient.
I feel like this is "common knowledge," so maybe I'm just missing something, but I'm not convinced yet.

Basically, what we're saying is that, first, there is a stage that cannot be accessed concurrently, and second, multiple threads may be waiting for that stage to complete before they progress.

So the downstream, waiting threads are either going to have to wait to acquire the lock, or they're going to have to wait for the producer to produce the data.

Either way, they're waiting.

If the critical section is really short and highly contended, you may not want threads waiting on the lock to suspend; you want them to spin, or to keep retrying (which is what happens in lock-free algorithms). OK, fine. Why isn't that solution just as good as having an extra thread to be the producer, and then waiting on the producer to produce more data?