Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
> How much VRAM does it take to get the 92-95% you are speaking of?
For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.
Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?
Depending on what your usage requirements are, Mac Minis running UMA over RDMA is becoming a feasible option. At roughly 1/10 of the cost you're getting much much more than 1/10 the performance. (YMMV)
I did not expect this to be a limiting factor in the mac mini RDMA setup ! -
> Thermal throttling: Thunderbolt 5 cables get hot under sustained 15GB/s load. After 10 minutes, bandwidth drops to 12GB/s. After 20 minutes, 10GB/s. Your 5.36 tokens/sec becomes 4.1 tokens/sec. Active cooling on cables helps but you’re fighting physics.
Thermal throttling of network cables is a new thing to me…
I admire patience of anyone who runs dense models on unified memory. Personally, I would rather feed an entire programming book or code directory to a sparse model and get an answer in 30 seconds and then use cloud in rare cases it's not enough.
It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.