|
|
|
|
|
by scottjg
46 days ago
|
|
I very recently ran the numbers on these GPUs for an upcoming blog post. The token generation performance is bad, but the prefill performance is _really_ bad. For a Qwen 3.6 35B / 3B MoE, 4-bit quant: - parsing a 4k prompt on a M4 Macbook Air takes 17 seconds before generating a single token. - on an M4 Max Mac Studio it's faster at 2.3 seconds - on an RTX 5090, it's 142ms. RTX 5090 uses more power than an M4 Max Mac Studio but it's not 16x more power. |
|