Hacker News new | ask | show | jobs
by jerf 254 days ago
If the author has not already, I would commend to them a search of the literature (or the relevant blog summaries) for the term "implicit parallelism". This was an academic topic from a few years back (my brain does not do this sort of thing very well but I want to say 10-20 years go) where the hope was that we could just fire some sort of optimization technique at normal code which would automatically extract all the implicit parallelism in the code and parallelize it, resulting in massive essentially-free gains.

In order to do this, the first thing that was done was to analyze existing source code and determine what the maximum amount of implicit parallelism was that was in the code, assuming it was free. This attempt then basically failed right here. Intuitively we all expect that our code has tons of implicitly parallelism that can be exploited. It turns out our intuition is wrong, and the maximum amount of parallelism that was extracted was often in the 2x range, which even if the parallelization was free it was only a marginal improvement.

Moreover, it is also often not something terribly amenable to human optimization either.

A game engine might be the best case scenario for this sort of code, but once you start putting in the coordination costs back into the charts those charts start looking a lot less impressive in practice. I have a sort of rule of thumb that the key to high-performance multithreading is that the cost of the payload of a given bit of coordination overhead needs to be substantially greater than the cost the coordination, and a games engine will not necessarily have that characteristic... it may have lots of tasks to be done in parallel, but if they

7 comments

I guess you can argue that instruction reordering, SMT/Hyper-threading are already eating the easy wins there. And as you said, it seems like the gains taper off at 2x.

I'm not sure why games would be a good target. They're traditionally very much tied to a single thread, because ironically, passing data to the graphics and display hardware and to multi threaded subroutines like physics all has to be synchronized.

The easiest way to do that without locking a bunch of threads is to let a single thread go as fast as possible through all that main thread work.

If you really want a game focused parallelization framework, look into the Entity Component System pattern. The developer defines the data and mutability flow of various features in the game.

Because the execution ordering is fully known, the frameworks can chunk, schedule, reorder, and fan-out, etc the work across threads with less waiting or cache misses.

"I'm not sure why games would be a good target..."

"If you really want a game focused parallelization framework, look into the Entity Component System pattern."

Exactly that. You can break a lot modern games nicely into a lot of little things being done to discrete entities very quickly. But there the problem is that it's easy for the things to be too small, meaning you don't have a lot of time to be "clever" in the code.

I'm ignoring the GPU and just looking at CPU for this. GPU is in its own world where parallelization is forced on you comprehensively, in a space forced to be amenable to that.

Sure there's work you can throw into ECS but its a paradigm shift that is not implicit and also highlights how much doesn't work.
Yeah, ECS is the approach I've heard can get games to have sufficient parallelism. Though only if you are careful about sticking to it properly, and with careful management of the dataflow between systems. I think the main thing is, apart from the difficulty of doing the analysis in the first place, if you're writing the code without thinking about parallelism, you will tend to introduce data dependcies all over the place and changing that structure is very difficult.
Agreed. It exemplifies what parallelism is on the table but also how many more guarantees need to be enforced to get it.

You're almost swinging the pendulum back to a fixed pipeline and I don't think you can get that for free.

ECS is not a game specific thing. It’s a redrawing of encapsulation boundaries which can be applied to systems broadly.

https://youtube.com/watch?v=wo84LFzx5nI

Back about 20 years, DO CONCURRENT went into FORTRAN. It says that the programmer claims all the iterations of a DO look are non-interfering. This was never really that useful. It holds for matrix multiply, but not much else. It's exactly what you want for the article's use case of adding things up in parallel.

There are a few standard cases for parallelism:

- There's no interaction between tasks at all for long periods. Easiest case. Use case is codebreaking and crypto mining. Best done on special purpose hardware.

- You're running a service with a large number of incoming connections. The connections are essentially independent of each other. This is easy to do concurrently, because the threads don't talk to each other. For N > 1000 or so, this is the useful case for server-side async.

- You have some big area or volume oriented computation, like weather prediction, where there are many local regions being computed, and the regions talk to their neighbors a bit. This is the classic supercomputer application.

- You have lots of little tasks going on, queuing up events for each other. This was the vision Alan Kay had for Smalltalk. It sometimes shows up inside games, and inside discrete event simulations. The internals of some operating systems work this way.

- Anything that runs on a GPU. Very limited interaction between tasks. Until you get to shadows, lighting, occlusion culling, reflections, and anything where you don't want to do N lights x M meshes processing. Vulkan gives you the low-level tools to deal with the interlocking needed to do that, which is why Vulkan is so complicated.

(I can't speak to LLM training; haven't been inside that.)

The Fortran committee botched DO CONCURRENT badly. The requirements imposed on the program only make it safe to execute its iterations in any serial order, but are not sufficient to allow safe parallel execution. So one can write a perfectly conforming DO CONCURRENT loop that will produce wrong answers when actually run in parallel. The problem in the spec seems to have been inadvertent but they have refused to fix it (and don’t seem to understand the problem either.)
One thing that might help is that state space explosion is mainly caused by mutability. Because each mutation potentially creates branching that can be difficult or impossible to statically analyze.

In other words, imperative programming with const is directly transpilable to/from functional programming.

Using modern techniques like higher order methods, scatter-gather arrays (similar to map-reduce), passing by value via copy-on-write, etc, code can be written that works like piping data between unix executables. Everything becomes a spreadsheet basically.

Which also has implications for multiprocessing. When all data is const, many of the async edge cases that we're normally forced to deal with go away. Loosely that means that we can use techniques similar to double buffering to create new mutated copies where only the part that changed was actually copied. The rest of the data can reference one source of truth. In practice, this turns out to most closely approximate optimal algorithms, where the best-effort imperative code may paint itself into a corner because logic design choices get dictated by the realities of the runtime instead of the problem domain itself, so the optimizer loses opportunities to simplify it.

I find that in my daily work, nearly all of my time is spent untangling highly-imperative mutable code. Because I must hold the entire state in my mind when mutable references (mostly to objects) are passed between functions. My higher-order method solutions often end up shorter, more readable, faster, even more efficient, but harder to explain to junior developers. So I would really like a compiler that implicitly converts imperative code like for() loops and references into const/functional code whose intermediate code (i-code) can be reduced to its simplest form.

This is also my main concern about Rust and other mainstream imperative languages like C# and even Javascript, that they encourage bare-hands methods that go away with const/functional code if the user is willing to accept a doubling or more of memory usage. Which becomes less of an issue over time as memory cost decreases.

Edit: I maybe should have said fork-join instead of scatter-gather, but they are related concepts.

> Using modern techniques like higher order methods, scatter-gather arrays (similar to map-reduce), passing by value via copy-on-write, etc, code can be written that works like piping data between unix executables. Everything becomes a spreadsheet basically.

I have built a decent amount of multithreaded and distributed systems. I agree in principle with everything you wrote in that paragraph. In practice, I find such techniques add a lot of unstable overhead in the processing phase as memory is copied, and again while the results are merged, conflicts resolved, etc. They also lock you into a certain kind of architecture where global state is periodically synchronized. So IMO the performance of these things is highly workload-dependent; for some, it is barely an improvement over serialized, imperative code, and adds a lot of cognitive overhead. For others, it is the obvious and correct choice, and pays huge dividends.

Mostly I find the benefit to be organizational, by providing clear interfaces between system components, which is handy for things developed by teams. But as you said, it requires developers to understand the theory of operation, and it is not junior stuff.

Completely agree that we could use better language & compiler support for such things.

Games have tons of opportunities for parallelism (outside of rendering which is obviously embarrassingly parallel) if they are designed for it from the getgo. Unfortunately, most game engines make decisions up front that force much of their game logic to be inherently serial for no good reason. It is definitely true that the sort of parallelism you get from architecting for it from the beginning bears little resemblance to what a compiler would be able to extract with an automatic semantics-preserving pass.
> would automatically extract all the implicit parallelism in the code and parallelize it, resulting in massive essentially-free gains.

Sounds like what mojo is attempting to do:

https://www.modular.com/mojo

There is some recent work on this too: https://dl.acm.org/doi/10.1145/3632880
What about pipelining? Something that has clear serial dependencies could still be pipelined to execute separate (but dependent at work boundaries) units of work in parallel.