AFAIK, the Go runtime is pretty NUMA-oblivious. The mcache helps a bit with locality of small allocations, but otherwise, you aren't going to get the same benefits (though I absolutely here you about avoiding execve overhead).
So...yes, the execve overhead is real. BUT there's still a lot you can accomplish with pure bash builtins (which don't have the execve overhead). And, if you're open to rewriting things (which would probably be required to some extent if you were to make something intended for shell to run in Go) you can port whatever you need to run into a bash builtin and bypass the execve overhead that way. In fact, doing that is EXACTLY what forkrun does, and is a big part of why it is so fast.
0. vfork (which is sometimes better than CoW fork) + execve if exec is the only outcome of the spawned child. Or, use posix_spawn where available.
1. Inner-loop hot path code {sh,c}ould be made a bash built-in after proving that it's the source of a real performance bottleneck. (Just say "no" to premature optimization.) Otherwise, rewrite the whole thing in something performant enough like C, C++, Rust, etc.
2. I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C doing it in 1 thread worker per core.
> I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C
Short answer: in its fastest mode, forkrun gets very close to the practical dispatch limit for this kind of workload. A tight C loop would still be faster, but at that point you're no longer comparing “parallel job dispatch”—you're comparing raw in-process execution.
Let me try and at least show what kind of performance forkrun gives here. Lets set up 1 billion newlines in a file on a tmpfs
cd /tmp
yes $'\n' | head -n 1000000000 > f1
now lets try frun echo
time { frun echo <f1 >/dev/null; }
real 0m43.779s
user 20m3.801s
sys 0m11.017s
forkrun in its "standard mode" hits about 25 million lines per second running newlines through a no-op (:), and ever so slightly less (23 million lines a second) running them through echo. The vast majority of this time is bash overhead. forkrun breaks up the lines into batches of (up to) 4096 (but for 1 billion lines the average batch size is probably 4095). Then for each batch, a worker-specific data-reading fd is advanced to the correct byte offset where the data starts, and the worker runs
mapfile -t -n $N -u $fd A # N is typically 4096 here
echo "${A[@]}"
The second command (specifically the array expansion into a long list of quoted empty args) is what is taking up the vast majority of the time. frun has a flag (-U) then causes it to replace `"${A[@]}"` with `${A[*]}`, which (in the case of all empty inputs) collapses the long string of quoted empty args into a long list of spaces -> 0 args. This considerably speeds things up when inputs are all empty.
time { frun -U echo <f1 >/dev/null; }
real 0m13.295s
user 6m0.567s
sys 0m7.267s
And now we are at 75 million lines per second. But we are still largely limited by passing data through bash....which is why forkrun also has a mode (`-s`) where it bypasses bash mapfile + array expansion all together and instead splices (via one of the forkrun loadable builtins) data directly to the stdin of whatever you are parallelizing. If you are parallelizing a bash builtin (where there is no execve cost) forkrun gets REALLY fast.
time { frun -s : < f1; }
real 0m0.985s
user 0m13.894s
sys 0m12.398s
which means it is delimiter scanning, dynamically batching and distributing (in batches of up to 4096 lines) at a rate of OVER 1 BILLION LIONES A SECOND or at a rate of ~250,000 batches per second.
At that point the bottleneck is basically just delimiter scanning and kernel-level data movement. There’s very little “scheduler overhead” left to remove—whether you write it in bash+C hybrids (like forkrun) or pure C.