Hacker News new | ask | show | jobs
by synergy20 972 days ago
same here, some companies are hiring a compiler team for AI in fact.

I was told Rice and UIUC provide the best compiler program, though not necessarily AI related, should be similar though.

1 comments

What does it mean, specifically, "compiler team for AI"? I get it that it's some hot new trend from the last 2 posts, but I'm struggling to imagine what exactly the perfect product should look like and why everyone wants it so bad, allegedly.
It's just about compiling a neural net down to run as efficiently as possible. Either on GPU, CPU, or your own accelerator. Neural nets are very computational intensive, while being pretty uniform internally and as a class. Everyone making silicon and a fair few other companies as well has an AI compiler team right now. At the moment the hot product would just be LLM tokens for as few cents each as possible.
Ai problems boil down to compiling linear algebra problems onto very complicated chips.

In standard processing, the code is so branchy that we often resort to heuristics in order to get 'good enough' perf.

The FLOPS difference between a cpu and gpu is huge. It makes things that are intractable on cpus possible. Without gpus there is no deep learning.

That being said, writing code for gpus by relying on cpu compilers will result in terrible perf. In order to take advantage of the hardware you have to take into account minute details of the architecture that most cpu compilers ignore.

Cache oblivious algorithms are algorithms that know that there is a cache but don't rely on particular cache sizes. It's the way a lot of cpu code is written because it means not having to deal with particulars.

On gpus, particulars matter. For example, to compile a matrix multiply on an Nvidia GPU, you cant just use vectorized multiplies and adds. No. In order to achieve max performance you need to utilize the warp level matrix multiply instruction which requires that you split an arbitrarily sized matrix into the perfect native tensor sizes and then orchestrate the memory loads (which are asynchronous on gpus, and transparently synchronous on cpus) correctly. If you don't you waste millions of dollars (literally).

So whereas on a cpu you might just modify your matrix multiply loops to get contiguous memory access and add some vectorization in and cross your fingers, on a gou your compiler needs to take the trivial three nested loop algorithm, look up the cache size particulars and instruction capabilities for the particular generation of the chip, and then rewrite the loop nesting to make it optimal. All while making sure you don't introduce further memory hazards (bank conflicts), etc. So your simple three nestled loop algorithm gets turned into a nine nested loop monstrosity.

The stakes are much higher here and the optimizations much different. Whereas on a cpu, we kind of give up with the branching complexity, and just do our best since we never truly know the state of the program, on gpus, the algorithms being executed are extremely amenable to static analysis so we do that, and optimize the shit out of them.

I believe it's either related to the new AI-specialized chips, or maybe to the factoring or neural network graphs.

Any specialized domain tend to have its own domain specific language, so obviously it would be true for AI, too.

correct, every AI chip maker needs their own compiler team these days