Hacker News new | ask | show | jobs
by wood_spirit 81 days ago
Thanks for making and thanks for sharing :)

I’m not a parallels kind of user but I can appreciate your craft and know how rewarding these odysseys can be :)

What was the biggest “aha” moment when you worked how things interlock or you needed to make both change an and b at the same time, as either on their own slowed it down? Etc. And what is the single biggest impacting design choice?

And if you’re objective, what could be done to other tools to make them competitive?

2 comments

So, in forkruns development there have been a few "AHA!" moments. Most of them were accompanied by a full re-write (current forkrun is v3).

The 1st AHA, and the basis for the original forkrun, was that you could eliminate a HUGE amount of the overhead of parallelizing things in shell in you use persistent workers and have them run things for you in a loop and distribute data to them. This is why the project is called "forkrun" - its short for "first you FORK, then you RUN".

The 2nd AHA, which spawned forkrun v2, was that you could distribute work without a central coordinator thread (which inevitably becomes the bottleneck). forkrun v2 did this by having 1 process dump data into a tmpfile on a ramdisk, then all the workers read from this file using a shared file descriptor and a lightweight pipe-based lock: write a newline into a shared anonymous pipe, read from pipe to acquire lock, write newline back to pipe to release it. FIFO naturally queues up waiters. This version actually worked really well, but it was a "serial read, parallel execute" design. Furthermore, the time it took to acquire and release a lock meant the design topped out at ~7 million lines per second. Nothing would make it faster, since that was the locking overhead.

The 3rd AHA was that I could make a very fast (SIMD-accellerated) delimiter scanner, post the byte offsets where lines (or batches of lines) started in the global data file, and then workers could claim batches and read data in parallel, making the design fully "parallel read + parallel execute"

The 4th AHA was regarding NUMA. it was "instead of reactively re-shuffling data between nodes, just put it on the right node to begin with". Furthermore, determine the "right node" using real-time backpressure from the nodes with a 3 chunk buffer to ensure the nodes are always fed with data. This one didn't need a rewrite, but is why forkrun scales SO WELL with NUMA.

> And if you’re objective, what could be done to other tools to make them competitive?

I wanted to reply separately to this bit, because I needed a bit of time to think about and respond to it.

To be frank, parallel optimizes for "breadth of features" and has, for example, the ability to coordinate distributed computing over ssh. But it fundamentally assumes that the workload itself will take dramatically longer than the coordination.

To really be competitive in "high-frequency low-latency workloads", where you have millions of inputs and each only takes microseconds, you would need a complete rewrite with an entirely different way of thinking.

Let me drop a few numbers just to drive this point home. Parallel is capable of batching and distributing around 500 batches of work a second. forkrun, in its "pass arguments via quoted cmdline args" mode is capable of batching and distributing around 10,000 batches a second. This is mostly limited by how fast bash can assemble long strings of quoted arguments to pass via the command line. In forkrun's `-s` mode, which bypasses bash entirely and splices data directly to the stdin of what you are parallelizing, forkrun is capable of batching and distributing over 200,000 batches a second.

The biggest architectural hurdle most existing tools have that makes it impossible to achieve forkrun's batch distribution rate is that almost all use a central distributor thread that forks each individual call (which is very expensive) and that is ALWAYS the bottleneck in high-frequency low-latency workloads. Pushing past this requires moving to a persistent worker model without a central coordinator. This alone necessitates a complete rewrite for basically all the existing tools.

That said, forkrun takes it so much further:

* It uses a SIMD-accelerated delimiter scanner + lock-free async-io to allow for workers to not only execute in parallel but to read inputs to run in parallel.

* It doesn't just use a standard "lock-free" design with CAS retry loops everywhere - it treats the problem like a physical pipeline of data flow and structurally eliminates contention between workers. The literal only "contention" is a single atomic on a single cache line - namely when a worker claims a batch by running `atomic_fetch_add` on a global monotonically increasing index (`read_idx`).

* It doesn't use heuristics - it uses a proper closed-loop control system. There is a 3-stage ramp-up (saturate workers -> geometric ramp -> backpressure-guided PID) to dynamically determine the batch size and the number of workers that works extremely well for all input types with 0 manual tuning.

* It keeps complexity in the slow path. Claiming a batch of lines literally just involves reading a couple shared mmap'ed vars and an `atomic_fetch_add` op in the fast path, which is why it can break 1 billion lines a second. The complexity is all so the slow path degrades gracefully, which is where it smartly trades latency for throughput (but only when throughput is limited by stdin to begin with).

* It treats NUMA as 1st class and chooses the "obvious in hindsight" path to just put data on the correct NUMA node from the very start instead of re-shuffling it between nodes reactively later.

I could go on, but the TL;DR is: to be competitive, other tools would really need to try and solve the "high-frequency low-latency stream parallelization" problem from first principles like forkrun did.

Great read, thanks :)

This is the kind of buzz I search out in my own programming :)

Have fun and keep challenged :)