| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by firstbabylonian 128 days ago

> SSD streaming to GPU

Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?

1: https://arxiv.org/abs/2312.11514

3 comments

simonw 128 days ago

Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

link

anemll 128 days ago

Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.

link

Yukonv 128 days ago

That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.

link

anemll 128 days ago

Check my repo, I had added some support for GUFF/untloth, Q3,Q5/Q8 https://github.com/Anemll/flash-moe/blob/iOS-App/docs/gguf-h...

link

3abiton 127 days ago

To be fair, it's "possible" to run such setup with llama.cpp with ssd offload. It's just abysmal TG speeds. But it's possible.

link

superjan 128 days ago

That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).

link

trebligdivad 127 days ago

I guess this is all set up to show off the new high-bandwidth-flash stuff that's due out soon?

link

zozbot234 128 days ago

A similar approach was recently featured here: https://news.ycombinator.com/item?id=47476422 Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model. (Unless you want to use Intel Optane wearout-resistant storage, but that was power hungry and thus unsuitable to a mobile device.)

link

Aurornis 128 days ago

> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.

link

zozbot234 128 days ago

Yes but most people are still running MoE models with all experts loaded in RAM! This experiment shows quite clearly that some experts are only rarely needed, so you do benefit from not caching every single expert-layer in RAM at all times.

link

Aurornis 128 days ago

That's not what this test shows. It's just loading the parts of the model that are used in an on-demand fashion from flash.

The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.

If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.

link

zozbot234 128 days ago

The writeup from the earlier experiment (running on a MacBook Pro) shows quite clearly that expert routing choices are far from uniform, and that some layer-experts are only used rarely. So you can save some RAM footprint even while swapping quite rarely.

link

Aurornis 128 days ago

I understand, but this isn't just a matter of not caching some experts. This is a 397B model on a device with 12GB of RAM. It's basically swapping experts out all the time, even if the distribution isn't uniform.

When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.

link

QuantumNomad_ 128 days ago

If I only use an LLM to ask questions about programming in one specific programming language, can I distill away other experts and get all the answers I need from a single expert? Or is it still different experts that end up handling the question depending on what else is in the question? For example, if I say “plan a static web server in Rust” it might use expert A for that, but if I say “implement a guessing game in Rust” it might use expert B, and so on?

link

Snoozus 127 days ago

Unfortunately no, experts are typically switched out for every token. The way I understand it the idea was something like having each expert be good at one kind of task, but that's not how it panned out after training.

link

anemll 127 days ago

17B includes 10 expert plus one shared. So actual size of the expert is much smaller

link

jnovek 128 days ago

I’m so confused in these comments right now — I thought you had to load an entire MoE model and sparseness just made it so you can traverse the model more quickly.

link

MillionOClock 128 days ago

I hope some company trains their models so that expert switches are less often necessary just for these use cases.

link

zozbot234 128 days ago

A model "where expert switches are less necessary" is hard to tell apart from a model that just has fewer total experts. I'm not sure whether that will be a good approach. "How often to switch" also depends on how much excess RAM has been available in the system to keep layers opportunistically cached from the previous token(s). There's no one-size fits all decision.

link

simonw 128 days ago

Yeah, this new post is a continuation of that work.

link

foobiekr 128 days ago

This is not entirely dissimilar to what Cerebus does with their weights streaming.

link

manmal 128 days ago

And IIRC the Unreal Engine Matrix demo for PS5 was streaming textures directly from SSD to the engine as well?

link

WatchDog 127 days ago

Yeah, also "RTX IO", and Microsoft "DirectStorage".

What was more interesting about the unreal engine demo, was that they can stream not only textures, but geometry too.

Virtual texturing had been around a long time, but virtual geometry with nanite is really interesting.

link