Hacker News new | ask | show | jobs
by zackmorris 79 days ago
This is great!

Forkrun is part of a vanishingly small number of projects written since the 1990s that get real work done as far as multicore computing goes.

I'm not super-familiar with NUMA, but hopefully its concepts might be applicable to other architectures. I noticed that you mentioned things like atomic add in the readme, so that gives me confidence that you really understand this stuff at a deep level.

My use case might eventually be to write a self-parallelizing programming language where higher-order methods run as isolated processes. Everything would be const by default to make imperative code available in a functional runtime. Then the compiler could turn loops and conditionals into higher-order methods since there are no side effects. Any mutability could be provided by monads enforcing the imperative shell, functional core pattern so that we could track state changes and enumerate all exceptional cases.

Basically we could write JavaScript/C-style code having MATLAB-style matrix operators that runs thousands of times faster than current languages, without the friction/limitations of shaders or the cognitive overhead of OpenCL/CUDA.

-

I feel that pretty much all modern computer architectures are designed incorrectly, which I've ranted about countless times on HN. The issue is that real workloads mostly wait for memory, since the CPU can run hundreds of times faster than load/store, especially for cache and branch prediction misses. So fabs invested billions of dollars into cache and branch prediction (that was the incorrect part).

They should have invested in multicore with local memories acting together as a content-addressable memory. Then fork with copy-on-write would have provided parallelism for free.

Instead, CPU progress (and arguably Moore's law itself) ended around 2007 with the arrival of the iPhone and Android, which sent R&D money to low-cost and low-power embedded chips. So the world was forced to jump on the GPU bandwagon, doubling down endlessly on SIMD instead of giving us MIMD.

Leaving us with what we have today: a dumpster fire of incompatible paradigms like OpenGL, Direct3D, Vulkan, Metal, TPUs, etc.

When we could have had transputers with unlimited compute and memory, scaling linearly with cost, that could run 3D and AI libraries as abstraction layers. Sadly that's only available in cloud computing currently.

We just got lucky that neural nets can run on GPUs. It would have been better to have access to the dozen or so other machine learning algorithms, especially genetic algorithms (which run poorly on GPUs).

Maybe your work can help bridge that gap.

1 comments

I appreciate the high praise re: forkrun.

forkrun's NUMA approach is really largely based on the idea that, as you said, "real workloads mostly wait for memory". The waiting for memory gets worse in NUMA because accessing memory from a different chiplet or a different socket requires accessing data that is physically farther from the CPU and thus has higher latency. forkrun takes a somewhat unique approach in dealing with this: instead of taking data in, putting it somewhere, and reshuffling it around based on demand, forkrun immediately puts it on the correct numa node's memory when it comes in. This creates a NUMA-striped global data memfd. on NUMA forkrun duplicates most of its machinery (indexer+scanner+worker pool) per node, and each node's machinery is only offered chunks from the global data memfd that are already on node-local memory.

This directly aims to solve (or at least reduce the effect from) "CPUs waiting for memory" on NUMA systems, where the wait (if memory has to cross sockets) can be substantial.