Hacker News new | ask | show | jobs
by sillysaurusx 1478 days ago
More generally, “stop storing state, unless it makes the program less complicated.”

The first version is simple. The second version is more complicated. The simplicity is because the first version can be represented as a formula; imagine trying to write out the second version as a formula, and the complexity becomes obvious.

(The addition loop would have to be written as a recurrence relation, which of course means every step depends on the previous step.)

Complexity isn’t always obvious. It’s sometimes common in C programs to write a loop like dst[index++] = src[i], particularly in deeply nested for-loops. In my experience it’s almost always worth rewriting it so that the indices are computable entirely from the loop iteration variables, with no state. It helps you build a mental map of the operations, because you can think of it in terms of geometry (memcpy = copying rectangles onto other larger rectangles) whereas when the index is stateful it becomes quite a lot harder to visualize in higher dimensions. At least for me.

We’ve been building a compiler at Groq for our custom ML hardware. (Roughly, “turn pytorch into our ISA with no extra programmer effort.”) I used to think of posts like this as “Well, modern compilers are so fast, who really cares about autovectorization?” — it turns out you care when you need to write your own compiler. :)

MLIR is pretty cool. It makes a lot of transformations like this pretty easy. The MLIR “krnl” dialect can also automatically transform nested loops into tiled iteration. Graphics devs will know what I mean — no need to loop over 8x8 blocks of pixels manually, just write “for x in range(width): for y in range(height): …” and set the block size to 8.

2 comments

> stop storing state, unless it makes the program less complicated

I had hoped that functional programming languages would lead us to automatic high-level parallelization. That hasn't happened much in the real world. But what we got is "write code like a functional programmer would even in a low-level language and the compiler and the ISA will parallelize it for you." That surprises me, but I'll take it.

Yeah I mean any functional operation on a list (I'm thinking of JS mainly) should be parallelizable, assuming the interpreter can determine that the order of execution does not matter.

I recall that Scala was the first language that introduced FP to me, and they made the developer explicitly tell that a functional operation could be done in parallel, after which the compiler and runtime would take care of the rest.

This was implemented in the extreme in a project that allowed you to write a functional style series of operations, which were translated to Hadoop map and reduce jobs, each of which could be executed in parallel at large scale. So very compact and succinct code, but very powerful.

It's because under the hood modern processors are embarrassingly parallel, so chip manufacturers like Intel were incentivized to put money into solving the hard problem "turn code written by devs who were trained on old serial-style languages (like C) into something that can pass a benchmark or we can't justify selling new chips because programs won't be faster."

The relative novelty and paucity of functional languages in the ecosystem means the incentive story hasn't happened to get the end-to-end from compiler through to execution so tight yet. I expect it'll happen, and when it does I expect the result will trigger fewer questions like this one because a newer generation of programmers well be less inclined to assume serial execution.

I think it's that it's easier to detect a lack of data dependencies in C than it is to write an efficient Haskell compiler. Well, that and the fact that there is at least an order of magnitude more people working on the former than the latter
memcpy is more like copying a single line, to copy a rectangle you need to call it in a loop or use specialized blit hw/sw.

How is MLIR different from LLVM IR?

And tile/block processing is not that novel for generic compute, graphics use it for other reasons - mostly how the pixel data is in memory and how the pipeline scales with tile size.

LLVM is geared towards providing a single toolchain for frontends to target, but is essentially made to target von Neumann/serial-ish/homogenous machines. You can get an overview of its philosophy in Lattner’s original 2004 paper [1]. MLIR (Lattner 2021 [2]) aims to generalize the compiler toolchain to support different compute paradigms and heterogeneous targets (polyhedral compilation/parallel kernels/afine loops) and so on, by defining a declarative framework of IR _dialects_ and transforms between them. I’m writing my MSc thesis on using C/C++ as the IR, assuming it has enough compiler support for different targets to provide a suitable platform for optimisation transforms using a source-to-source compiler, but MLIR seems like a better first-principles approach, maybe suffering from the impracticality of using it _right now_.

[1] "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation", Lattner and Adve. https://llvm.org/pubs/2004-01-30-CGO-LLVM.pdf

[2] “MLIR: Scaling Compiler Infrastructure for Domain Specific Computation”, Lattner et al. https://ieeexplore.ieee.org/document/9370308