|
|
|
|
|
by poorman
309 days ago
|
|
This article really goes into a lot of detail which is nice. gpt-oss is just not good for agentic use in my observation. tldr; I'll save you a lot of time trying things out for yourself. If you are on a >=32 GB Mac download LMStudio and then the `qwen3-coder-30b-a3b-instruct-mlx@5bit` model. It uses ~20 GB of RAM so a 32GB machine is plenty. Set it up with opencode [1] and you're off to the races! It has great tool calling ability. The tool calling ability of gpt-oss doesn't even come close in my observations. [1] https://opencode.ai/ |
|
…I struggle to comprehend how an odd quantization like 5 bit, that doesn't align well with 8 bit boundaries, would not slow things down for inference: given that on one hand the hardware doing the multiplications doesn't support vectors of 5 bit values but needs repacking to 8 bit before multiplication, and on the other hand the weights can't be bulk-repacked to 8 bit once and for all in advance (otherwise it wouldn't fit inside the RAM, besides in that case one would use a 8 bit quantization anyways)
it would require quite a lot of instructions per multiplication (way more than for 4 bit quantization where the alignment match simplifies things) to ad-hoc repack the 5 bit values to vectors of 8 bit. So i kinda wonder how much (percentage-wise) that would impact inference performance