| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by philipturner 1112 days ago

antinucleon lol

LLaMA is a memory-bound AI model, where the dominant factor in execution time is how fast the processor transfers weights from RAM to registers. LLaMA.cpp uses a misaligned memory pattern that's painful to the RAM I/O interface and requires many CPU instructions to re-align. It also has access to 1/2 the bandwidth of the GPU on Apple's highest-end chips, making GPU *theoretically* 2x faster without algorithm changes.

Funny how you announced it the exact day after I open-sourced a high-bandwidth 4-bit GEMV kernel for LLaMA. If anyone wants to see how to achieve 180+ GB/s on the M1/M2 Max, you can reproduce the methodology here:

https://github.com/ggerganov/llama.cpp/pull/1642#issuecommen...

I later explained the technique to write very optimized kernels for a specific quantization format. *It's very unlikely my code was copied verbatim*, but it was probably used as inspiration and/or a reference. I also disclosed a high-precision means of measuring bandwidth utilization, which is critical for designing such GPU kernels.

1 comments

philipturner 1112 days ago

To give more credit, I know your company was working on your own optimizations, which are prior art. It's possible that you made a 180 GB/s shader on your own (quite slow compared to my 319 GB/s). Or that the 319 GB/s was used, but the self-attention bottleneck was non-negligible.

However, for whatever reason, when Greg Gerganov started work on the Metal backend, you made a product announcement almost a day later. That seems like a non-coincidence and there must be some logical explanation.

link