| HN Mirror

parallel works fine so long as the time per job is on the order of seconds or longer.

Let me give you an example of a "worst-case" scenario for parallel. Start by making a file on a tmpfs with 10 million newlines

    yes $'\n' | head -n 10000000 > /tmp/f1

So, now lets see how long it takes parallel to push all these lines through a no-op. This measures the pure "overhead of distributing 10 million lines in batches". Ill set it to use all my cpu cores (`-j $(nproc)`) and to use multiple lines per batch (`-m`).

    time { parallel -j $(nproc) -m : <f1; }

    real    2m51.062s
    user    2m52.191s
    sys     0m6.800s

Average CPU utalization here (on my 14c/28t i9-7940x) is CPU time / real time

    (172.191 + 6.8) / 171.062 = 1.0463516152 CPUs utalized

Note that there is 1 process that is pegged at 100% usage the entire time that isnt doing any "work" in terms of processing lines - its just distributing lines to workers. If we assume that thread averaged about 0.98 cores utalized, it means that throughout the run it managed to keep around 0.066 out of 28 CPUs saturated with actual work.

Now let's try with frun

    . ./frun.bash
    time { frun : <f1; }

    real    0m0.559s
    user    0m10.409s
    sys     0m0.201s

CPU utilization is

    ( 10.409 + .201 ) / .559 = 18.9803220036 CPUs utalized

Lets compare the wall clock times

    171.062 / 0.559 = 306x speedup

Interestingly, if we look at the ratio of CPU utilization (spent on real work):

    18.9803220036 / 0.066 = 287x more CPU usage doing actual work

which gives a pretty straightforward story - forkrun is 300x faster here because it is utilizing 300x more CPU for actually doing work.

This regime of "high frequency low latency tasks" - millions or billions of tasks that make milliseconds or microseconds each - is the regime where forkrun excels and tools like parallel fall apart.

Side note: if I bump it to 100 million newlines:

    time { frun : <f1; }

    real    0m4.212s
    user    1m52.397s
    sys     0m1.019s

CPU utilization:

    ( 112.397 + 1.019 ) / 4.212 = 26.9268 CPUs utalized

which on a 14c/28t CPU doing no-ops...isnt bad.