| > It's the same thing here. CPUs can run it but only as a gimmick. No, that's not true. I work on local inference code via llama.cpp, on both GPU and CPU on every platform, and the bottleneck is much more ram / bandwidth than compute. Crappy Pixel Fold 2022 mid-range Android CPU gets you roughly same speed as 2024 Apple iPhone GPU, with Metal acceleration that dozens of very smart people hack on. Additionally, and perhaps more importantly, Arc is a GPU, not a CPU. The headline of the thing you're commenting on, the very first thing you see when you open it, is "Run llama.cpp Portable Zip on Intel GPU" Additionally, the HN headline includes "1 or 2 Arc 7700" |
A770 has 16GB of RAM. You're shuffling data to the GPU at a rate of 64GB/s, which is magnitudes slower than the internal VRAM of the GPU. Hence, this setup is memory bandwidth constrained.
However, once you want to use it to do anything useful like a longer context size, the CPU compute will be a huge bottleneck for time-to-first-token as well as tokens/s.
Trying to run a model this large, and a thinking one at that, on CPU RAM is a gimmick.