| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 37 days ago
	It could run viably with SSD offload on Macs with very little memory. You could even exploit batching to make the model almost compute limited even in that challenging setting, seeing as the KV cache is so extremely small (for non-humongous context). In fact, if that approach can be made to work I'd like to see a comparison between DS4 Flash and Pro on the same (Mac) hardware.

1 comments

Havoc 37 days ago

>It could run viably with SSD offload on Macs with very little memory

Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range

Lovely as modern nvmes are they're not memory

link

zozbot234 37 days ago

You can run multiple inferences in parallel on the same set of weights, that's what batching is. Given enough parallelization it can be almost entirely compute-limited, at least for small context (max ~10GB per request apparently, but that's for 1M tokens!)

link

happyPersonR 37 days ago

Yes I think what this demonstrates that folks are missing is that now optimization for specific scenarios is quite possible.

link

Havoc 37 days ago

For offline work that's fine I guess, but batched or not <1tks is largely unusable for most usage cases

link

zozbot234 37 days ago

I just think this potential workflow needs to be tested so that we know if anything breaks or makes it infeasible. Ultimately it would be slow when running any single agent, but you might be working with a huge amount of them in parallel. I view this as potentially a great way of repurposing low-RAM hardware with this specific model.

link