| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anon373839 45 days ago

> This is not a local model for any reasonable definition of local

That's true for now. I am hopeful that once the hardware markets have recovered from OpenAI's sabotage, we will see more hardware dedicated to local inference that can handle these big models.

Also, I'm thinking about the unique MoE routing that Apple is using with their new Apple Foundation Model. The model is trained and architected so that experts are not swapped for every token, but only occasionally. This suggests that e.g., a 744B parameter model in the future could have experts offloaded to SSD and still run with the effective computing requirements of a 40B model.

3 comments

timschmidt 45 days ago

Reading weights out of memory is the definition of a large linear read. I'm a bit mystified someone hasn't put an embarrassingly parallel flash storage controller next to some tensor processors on a PCIe card. It could have 4Tb of flash hanging off enough channels to saturate SRAM skipping DRAM entirely, and could even offload prompt processing to a GPU in the same workstation so long as it got reasonable tokens/s in inference. I'd buy one tomorrow.

link

adrian_b 45 days ago

For the last year, there has been development work at several companies for products including HBF (high-bandwidth flash memory) as a supplement to HBM, in order to enable running inference for big LLMs at a reasonable cost, e.g. on one GPU-like card.

HBF was initially announced by SanDisk, early in 2025, then early this year Hynix has announced that they have joined SanDisk in producing HBF, and that the common specification will be standardized under the Open Compute Project.

With HBF, it would be easy to make a GPU card with 4 TB of HBF, which could run the biggest existing open weights LLMs in their native unquantized form.

link

timschmidt 45 days ago

Exciting news! This is how I see running frontier models at home becoming reasonably affordable. Though it may take a depreciation cycle or two.

link

zozbot234 45 days ago

For sparse MoE models, the single expert layers that the inference gets sampled from are actually quite small - single-digit megabytes or so.

link

tshaddox 45 days ago

Is there reason to expect the consumer hardware markets to recover any time soon?

Is there reason to expect they’ll ever recover without an AI bust that takes down the U.S. economy?

link

20after4 45 days ago

I don't think it'll ever recover. Partially perhaps. But we have bigger problems to worry about really.

link

zozbot234 45 days ago

Normally, experts are picked for every layer not just every token. But there are plausible ways of getting around that bottleneck while streaming if you can batch many inferences together. Still, the Apple approach of swapping the experts only rarely is interesting, though it likely degrades the model a lot.

link

FridgeSeal 45 days ago

Just get the bigger models to figure out the architecture required for hot-swappable sub-experts without loss of performance!

Got all those tokens, isn’t that the point of auto research and friends??

(Only sort of joking).

link