| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by christina97 51 days ago
	I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment. I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

4 comments

aimxhaisse 50 days ago

It even fits on a 3060 with turboquant / Q4 at decent speed (40T/s) for ~200$ (:

link

2ndorderthought 50 days ago

Some of the early quants for qwen3.6 were broken. It's still finicky but with a little hand holding it's crazy.

Local models are the future it's awesome

link

jszymborski 51 days ago

The A4B model is blazing fast and the model is super good at general inquiries. Notably worse than Qwen 3.6 for coding tasks but that says more about the Qwen model.

link

maille 50 days ago

Bad at coding, but would it be good at code review?

link

avadodin 50 days ago

Good compared to what? Nothing? Probably better.

link

moffkalast 50 days ago

The 31B is surprisingly fast too, for a dense model. Runs tg at least twice as fast as it ought to on my machine when compared to other 30B, probably due to the hybrid attention I guess. Ingestion is somewhat slower though.

link