| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by visarga 47 days ago
	Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.

4 comments

antirez 47 days ago

DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.

link

brcmthrowaway 47 days ago

Why is this the case?

Are there any architectures that don't rely on feeding the entire history back into the chat?

Recurrent LLMs?

link

bel8 47 days ago

And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

link

antirez 46 days ago

It runs both q2 and original (4 bit routed experts). At the same speed more or less. The q2 quants are not what you could expect: it works extremely well for a few reasons. For the full model you need a Mac with 256GB.

link

someone13 46 days ago

Out of curiosity, do you have any theories of why it works so well at such aggressive quantization levels?

link

antirez 46 days ago

It's a mix of extreme sparsity but with the routed expert doing a non trivial amount of work (and it is q8), and projections and routing not being quantized as well. Also the fact it's a QAT model must have a role I guess, and I quantized routed experts out layers with Q2 instead of IQ2_XXS to retain quality.

link

happyPersonR 46 days ago

Not trying to give anyone homework thinking out loud :

One thing I would love to see is if this dogfoods itself

Like would dsv4 with q2 be able to do this task itself on this hardware ?

Sidenote: I wish I had a M4-m3 … thinking about getting a ASUS ROG Flow Z13 Gaming Laptop (Model GZ302EA-XS99) uses pcie 4.0 so disk might be a little slower, but I want to see how this does on like Vulcan :)

link

habosa 46 days ago

Can you ELI5 why this is so slow for local inference but so fast for using hosted models?

link