| HN Mirror

> I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C

Short answer: in its fastest mode, forkrun gets very close to the practical dispatch limit for this kind of workload. A tight C loop would still be faster, but at that point you're no longer comparing “parallel job dispatch”—you're comparing raw in-process execution.

Let me try and at least show what kind of performance forkrun gives here. Lets set up 1 billion newlines in a file on a tmpfs

    cd /tmp
    yes $'\n' | head -n 1000000000 > f1

now lets try frun echo

    time { frun echo <f1 >/dev/null; }

    real    0m43.779s
    user    20m3.801s
    sys     0m11.017s

forkrun in its "standard mode" hits about 25 million lines per second running newlines through a no-op (:), and ever so slightly less (23 million lines a second) running them through echo. The vast majority of this time is bash overhead. forkrun breaks up the lines into batches of (up to) 4096 (but for 1 billion lines the average batch size is probably 4095). Then for each batch, a worker-specific data-reading fd is advanced to the correct byte offset where the data starts, and the worker runs

    mapfile -t -n $N -u $fd A    # N is typically 4096 here
    echo "${A[@]}"

The second command (specifically the array expansion into a long list of quoted empty args) is what is taking up the vast majority of the time. frun has a flag (-U) then causes it to replace `"${A[@]}"` with `${A[*]}`, which (in the case of all empty inputs) collapses the long string of quoted empty args into a long list of spaces -> 0 args. This considerably speeds things up when inputs are all empty.

    time { frun -U echo <f1 >/dev/null; }

    real    0m13.295s
    user    6m0.567s
    sys     0m7.267s

And now we are at 75 million lines per second. But we are still largely limited by passing data through bash....which is why forkrun also has a mode (`-s`) where it bypasses bash mapfile + array expansion all together and instead splices (via one of the forkrun loadable builtins) data directly to the stdin of whatever you are parallelizing. If you are parallelizing a bash builtin (where there is no execve cost) forkrun gets REALLY fast.

    time { frun -s : < f1; }

    real    0m0.985s
    user    0m13.894s
    sys     0m12.398s

which means it is delimiter scanning, dynamically batching and distributing (in batches of up to 4096 lines) at a rate of OVER 1 BILLION LIONES A SECOND or at a rate of ~250,000 batches per second.

At that point the bottleneck is basically just delimiter scanning and kernel-level data movement. There’s very little “scheduler overhead” left to remove—whether you write it in bash+C hybrids (like forkrun) or pure C.