| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sigmoid10 352 days ago
	I hate to say it, but reasoning models simply aren't suited for edge computing. I just ran some tests on this model and even at 4bit weight quantisation it blows past 10GB of VRAM with just ~1000 tokens while it is still reasoning. So even if you're running on a dedicated ML edge device like a $250 Jetson, you will run out of memory before the model even formulates a real answer. You'll need a high end GPU to make full use of it for limited answers and an enterprise grade system to support longer contexts. And with reasoning turned off I don't see any meaningful improvement over older models. So this is primarily great for enterprises who want to do on-prem with limited budgets and maybe high-end enthusiasts.

1 comments

wizee 352 days ago

You should use flash attention with KV cache quantization. I routinely use Qwen 3 14B with the full 128k context and it fits in under 24 GB VRAM. On my Pixel 8, I've successfully used Qwen 3 4B with 8K context (again with flash attention and KV cache quantization).

link

sigmoid10 351 days ago

>On my Pixel 8, I've successfully used Qwen 3 4B

How many tokens/s? I can't imagine that this would run in any practical way.

link