I’m not sure what to say; responsive fast output is ideal, and the larger model is distinctly slower for me, particularly for long completions (2k tokens) if you’re using a restricted grammar like json output.
I’m using an M2 not an M3 though; maybe it’s better for you.
I was under the impression quantised results were generally slower too, but I’ve never dug into it (or particularly noticed a difference between q4/q5/q6).
I’m using an M2 not an M3 though; maybe it’s better for you.
I was under the impression quantised results were generally slower too, but I’ve never dug into it (or particularly noticed a difference between q4/q5/q6).
If you find it fast enough to use then go for it~