Hacker News new | ask | show | jobs
by tarruda 891 days ago
> but the larger model is a lot slower.

I found the performance to be very acceptable for 33b 4 bit on a m3 max with 36gb ram (much faster than reading speed)

1 comments

I’m not sure what to say; responsive fast output is ideal, and the larger model is distinctly slower for me, particularly for long completions (2k tokens) if you’re using a restricted grammar like json output.

I’m using an M2 not an M3 though; maybe it’s better for you.

I was under the impression quantised results were generally slower too, but I’ve never dug into it (or particularly noticed a difference between q4/q5/q6).

If you find it fast enough to use then go for it~