Hacker News new | ask | show | jobs
by wokwokwok 891 days ago
I’m not sure what to say; responsive fast output is ideal, and the larger model is distinctly slower for me, particularly for long completions (2k tokens) if you’re using a restricted grammar like json output.

I’m using an M2 not an M3 though; maybe it’s better for you.

I was under the impression quantised results were generally slower too, but I’ve never dug into it (or particularly noticed a difference between q4/q5/q6).

If you find it fast enough to use then go for it~