DeepSeek v4 Flash is still over 100GB at Q4 IIRC, and Q4 has generally been the sweet spot. Although it's an MoE so it might run a lot faster that this dense Mistral model if you have the RAM.
"Q4 has generally been the sweet spot" for self-hosting, yes. For any real meaningful work it's dumb AF. The only way to get reasonable intelligence from mid-size Gemma or Qwen is to run full precision BF16. Anything else is just an emulation of AI.
I would disagree. I have 8 GB of VRAM and 32 GB of RAM. I can either run a 4B BF16 dense model fully on GPU at around 30 t/s or Qwen3.6 35B A3B Q5_K_M at 20 t/s with GPU offload. Which one would I choose?