| Python coding is practically the only usecase for local for me. Cloud llm are able to run 1 trillion parameters and have all of python knowledge in a transparent rag that's 100gbit or faster. Of course they'll be the bestest on the block. But when the new GPT coding benchmarks only barely behind grok 4 or gpt5 with high reasoning. >Model(s) & size: exact name/version, and quantization (e.g., Q4_K_M). My most reliable setup is Devstral + openhands. unsloth Q6_K_XL, 85,000 context, flash attention, kcache and vcache quant at Q8. Second most reliable. GPT-OSS-20B + opencode. Default MXFP4, I can only load up 31,000 context or it fails?(still plenty but hoping this bug gets fixed), you cant use flash attention or kv or v quantization or it becomes dumb as rocks. This harmony stuff is annoying. Still preliminary, just got working today, but testing is really good. Qwen3-30b-a3b-thinking-2507 + roo code or qwencode, 80,000 context, unsloth q4_k_xl, flash attention, kcache and vcache quant at Q8. >Runtime/tooling: e.g., Ollama, LM studio, etc. LM studio. I need vulkan for my setup. rocm is just a pain in the ass. They need to support way more linux distros. 24gb vram. |