| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by refulgentis 470 days ago

> It's the same thing here. CPUs can run it but only as a gimmick.

No, that's not true.

I work on local inference code via llama.cpp, on both GPU and CPU on every platform, and the bottleneck is much more ram / bandwidth than compute.

Crappy Pixel Fold 2022 mid-range Android CPU gets you roughly same speed as 2024 Apple iPhone GPU, with Metal acceleration that dozens of very smart people hack on.

Additionally, and perhaps more importantly, Arc is a GPU, not a CPU.

The headline of the thing you're commenting on, the very first thing you see when you open it, is "Run llama.cpp Portable Zip on Intel GPU"

Additionally, the HN headline includes "1 or 2 Arc 7700"

2 comments

aurareturn 470 days ago

It's both compute and bandwidth constrained - just like trying to run Crysis on CPU rendering.

A770 has 16GB of RAM. You're shuffling data to the GPU at a rate of 64GB/s, which is magnitudes slower than the internal VRAM of the GPU. Hence, this setup is memory bandwidth constrained.

However, once you want to use it to do anything useful like a longer context size, the CPU compute will be a huge bottleneck for time-to-first-token as well as tokens/s.

Trying to run a model this large, and a thinking one at that, on CPU RAM is a gimmick.

link

refulgentis 470 days ago

Okay, let's stipulate LLMs are compute and bandwidth sensitive (of course!)...

#1, should highlight it up front this time: We are talking about _G_PUs :)

#2 You can't get a single consumer GPU that has enough memory to load a 670B parameter model, there's some magic going on here. It's notable and distinct. This is probably due to FlashMoE, given it's prominence in the link.

TL;Dr: 1) these are Intel _G_PUs, and 2) it is a remarkable distinct achievement to be loading a 670B parameter model on only one to two cards

link

aurareturn 470 days ago

1) This system mostly uses normal DDR RAM, not GPU VRAM.

2) M3 Ultra can load Deepseek R1 671B Q4.

Using a very large LLM across the CPU and GPU is not new. It's been done since the beginning of local LLMs.

link

xoranth 470 days ago

> Crappy Pixel Fold 2022 mid-range Android CPU

Can you share what LLMs do you run on such small devices/what user case they address?

(Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)

link

refulgentis 470 days ago

Excellent and accurate q. You sound like the first person I've talked to who might appreciate full exposition here, apologies if this is too much info. TL;DR is you're def not missing anything, and we're just beginning to turn a corner and see some rays of light of hope, where it's a genuine substitute for remote models in consumer applications.

#1) I put a lot of effort into this and, quite frankly, it paid off absolutely 0 until recently.

#2) The "this" in "I put a lot of effort into this", means, I left Google 1.5 years ago and have been quietly building an app that is LLM-agnostic in service of coalescing a lot of nextgen thinking re: computing I saw that's A) now possible due to LLMs B) was shitcanned in 2020, because Android won politically, because all that next-gen thinking seemed impossible given it required a step change in AI capabilities.

This app is Telosnex (telosnex.com).

I have a couple stringent requirements I enforce on myself, it has to run on every platform, and it has to support local LLMs just as well as paid ones.

I see that as essential for avoiding continued algorithmic capture of the means of info distribution, and believe on a long enough timeline, all the rushed hacking people have done to llama.cpp to get model after model supported will give away to UX improvements.

You are completely, utterly, correct to note that the local models on device are, in my words, useless toys, at best. In practice, they kill your battery and barely work.

However, things did pay off recently. How?

#1) llama.cpp landed a significant opus of a PR by @ochafik that normalized tool handling across models, as well as implemented what the models need individually for formatting

#2) Phi-4 mini came out. Long story, but tl;dr: till now there's been various gaping flaws with each Phi release. This one looked absent of any issues. So I hack support for its tool vagaries on top of what @ochafik landed, and all of a sudden I'm seeing the first local model sub-Mixtral 8x7B that's reliably handling RAG flows (i.e. generate search query, then, accept 2K tokens of parsed web pages and answer a q following directions I give you) and tool calls (i.e. generate search query, or file operations like here: https://x.com/jpohhhh/status/1897717300330926109)

link