We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
We've had solid CPUs for a while, but GPUs have lagged behind (and they're the ones that matter for this particular application). iPhones still lead by a comfortable margin on this front, but have historically been pretty limited on the IO front (only supported USB2 speeds until recently).
The GPUs are perfectly solid. Cheap Android handsets have shipped with Vulkan compliance for almost a decade now; the GPUs are equally-featured to consoles and PCs. The same goes for Apple handsets that run byte-identical Metal Compute Shaders to the Mac. For desktop use they are perfectly amenable. The hardware lacks nothing required for inference or gaming that dGPUs ordinarily support.
And even if you raise the requirements, we still have to contend with cheap CUDA-capable GPUs like the one in the ($300!!!) Nintendo Switch, or the Jetson SOCs. The mobile market has had tons of high-speed/low-power options for a very long time now.
The iPhone 17 Pro launched 8 months ago with 50% more RAM and about double the inference performance of the previous iPhone Pro (also 10x prompt processing speed).
It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
If the bottleneck is storage bandwidth that's not "slow". It's only slow if you insist on interactive speeds, but the point of this is that you can run cheap inference in bulk on very low-end hardware.
You're simply pointing out that most people who use AI today expect interactive speeds. You're right that the point here is not raw power efficiency (having to read from storage will impact energy per operation, and datacenter-scale AI hardware beats edge hardware anyway by that metric) but the ability to repurpose cheaper, lesser-scale hardware is also compelling.
The real improvement will be when the software engineers get into the training loop. Then we can have MoE that use cache-friendly expert utilisation and maybe even learned prefetching for what the next experts will be.
> maybe even learned prefetching for what the next experts will be
Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.
I mean, by any reasonable standard it still is. Almost any computer can run an llm, it's just a matter of how fast, and 0.4k/s (peak before first token) is not really considered running. It's a demo, but practically speaking entirely useless.
Devils advocate - this actually shows how promising TinyML and EdgeML capabilities are. SoCs comparable to the A19 Pro are highly likely to be commodified in the next 3-5 years in the same manner that SoCs comparable to the A13 already are.
They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).