|
|
|
|
|
by philipturner
1112 days ago
|
|
antinucleon lol LLaMA is a memory-bound AI model, where the dominant factor in execution time is how fast the processor transfers weights from RAM to registers. LLaMA.cpp uses a misaligned memory pattern that's painful to the RAM I/O interface and requires many CPU instructions to re-align. It also has access to 1/2 the bandwidth of the GPU on Apple's highest-end chips, making GPU *theoretically* 2x faster without algorithm changes. Funny how you announced it the exact day after I open-sourced a high-bandwidth 4-bit GEMV kernel for LLaMA. If anyone wants to see how to achieve 180+ GB/s on the M1/M2 Max, you can reproduce the methodology here: https://github.com/ggerganov/llama.cpp/pull/1642#issuecommen... I later explained the technique to write very optimized kernels for a specific quantization format. *It's very unlikely my code was copied verbatim*, but it was probably used as inspiration and/or a reference. I also disclosed a high-precision means of measuring bandwidth utilization, which is critical for designing such GPU kernels. |
|
However, for whatever reason, when Greg Gerganov started work on the Metal backend, you made a product announcement almost a day later. That seems like a non-coincidence and there must be some logical explanation.