|
|
|
|
|
by dietr1ch
75 days ago
|
|
> Have you ever run GNU Parallel on a powerful machine just to find one core pegged at 100% while the rest sit mostly idle? Not exactly, but maybe I haven't used large enough NUMA machines to run tiny jobs? I think usually parallel saturates my CPU and I'd guess most CPU schedulers are NUMA-aware at this point. If you care about short tasks maybe parallel is the wrong tool, but if picking the task to run is the slow part AND you prefer throughput over latency maybe you need batching instead of a faster job scheduling tool. I'm pretty sure parallel has some flags to allow batching up to K-elements, so maybe your process can take several inputs at once. Alternatively you can also bundle inputs as you generate them, but that might require a larger change to both the process that runs tasks and the one that generates the inputs for them. |
|
Let me give you an example of a "worst-case" scenario for parallel. Start by making a file on a tmpfs with 10 million newlines
So, now lets see how long it takes parallel to push all these lines through a no-op. This measures the pure "overhead of distributing 10 million lines in batches". Ill set it to use all my cpu cores (`-j $(nproc)`) and to use multiple lines per batch (`-m`). Average CPU utalization here (on my 14c/28t i9-7940x) is CPU time / real time Note that there is 1 process that is pegged at 100% usage the entire time that isnt doing any "work" in terms of processing lines - its just distributing lines to workers. If we assume that thread averaged about 0.98 cores utalized, it means that throughout the run it managed to keep around 0.066 out of 28 CPUs saturated with actual work.Now let's try with frun
CPU utilization is Lets compare the wall clock times Interestingly, if we look at the ratio of CPU utilization (spent on real work): which gives a pretty straightforward story - forkrun is 300x faster here because it is utilizing 300x more CPU for actually doing work.This regime of "high frequency low latency tasks" - millions or billions of tasks that make milliseconds or microseconds each - is the regime where forkrun excels and tools like parallel fall apart.
Side note: if I bump it to 100 million newlines:
CPU utilization: which on a 14c/28t CPU doing no-ops...isnt bad.