| HN Mirror

Ai problems boil down to compiling linear algebra problems onto very complicated chips.

In standard processing, the code is so branchy that we often resort to heuristics in order to get 'good enough' perf.

The FLOPS difference between a cpu and gpu is huge. It makes things that are intractable on cpus possible. Without gpus there is no deep learning.

That being said, writing code for gpus by relying on cpu compilers will result in terrible perf. In order to take advantage of the hardware you have to take into account minute details of the architecture that most cpu compilers ignore.

Cache oblivious algorithms are algorithms that know that there is a cache but don't rely on particular cache sizes. It's the way a lot of cpu code is written because it means not having to deal with particulars.

On gpus, particulars matter. For example, to compile a matrix multiply on an Nvidia GPU, you cant just use vectorized multiplies and adds. No. In order to achieve max performance you need to utilize the warp level matrix multiply instruction which requires that you split an arbitrarily sized matrix into the perfect native tensor sizes and then orchestrate the memory loads (which are asynchronous on gpus, and transparently synchronous on cpus) correctly. If you don't you waste millions of dollars (literally).

So whereas on a cpu you might just modify your matrix multiply loops to get contiguous memory access and add some vectorization in and cross your fingers, on a gou your compiler needs to take the trivial three nested loop algorithm, look up the cache size particulars and instruction capabilities for the particular generation of the chip, and then rewrite the loop nesting to make it optimal. All while making sure you don't introduce further memory hazards (bank conflicts), etc. So your simple three nestled loop algorithm gets turned into a nine nested loop monstrosity.

The stakes are much higher here and the optimizations much different. Whereas on a cpu, we kind of give up with the branching complexity, and just do our best since we never truly know the state of the program, on gpus, the algorithms being executed are extremely amenable to static analysis so we do that, and optimize the shit out of them.