Hacker News new | ask | show | jobs
by pbgcp2026 48 days ago
"Q4 has generally been the sweet spot" for self-hosting, yes. For any real meaningful work it's dumb AF. The only way to get reasonable intelligence from mid-size Gemma or Qwen is to run full precision BF16. Anything else is just an emulation of AI.
1 comments

I would disagree. I have 8 GB of VRAM and 32 GB of RAM. I can either run a 4B BF16 dense model fully on GPU at around 30 t/s or Qwen3.6 35B A3B Q5_K_M at 20 t/s with GPU offload. Which one would I choose?