| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by scottjg 94 days ago

I very recently ran the numbers on these GPUs for an upcoming blog post. The token generation performance is bad, but the prefill performance is _really_ bad.

For a Qwen 3.6 35B / 3B MoE, 4-bit quant:

- parsing a 4k prompt on a M4 Macbook Air takes 17 seconds before generating a single token.

- on an M4 Max Mac Studio it's faster at 2.3 seconds

- on an RTX 5090, it's 142ms.

RTX 5090 uses more power than an M4 Max Mac Studio but it's not 16x more power.

1 comments

bigyabai 93 days ago

That's just a 4k context too. At a realistic context window of 16-32k tokens, the comparison becomes downright unfair.

link