| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by incomingpain 311 days ago

Python coding is practically the only usecase for local for me.

Cloud llm are able to run 1 trillion parameters and have all of python knowledge in a transparent rag that's 100gbit or faster. Of course they'll be the bestest on the block.

But when the new GPT coding benchmarks only barely behind grok 4 or gpt5 with high reasoning.

>Model(s) & size: exact name/version, and quantization (e.g., Q4_K_M).

My most reliable setup is Devstral + openhands. unsloth Q6_K_XL, 85,000 context, flash attention, kcache and vcache quant at Q8.

Second most reliable. GPT-OSS-20B + opencode. Default MXFP4, I can only load up 31,000 context or it fails?(still plenty but hoping this bug gets fixed), you cant use flash attention or kv or v quantization or it becomes dumb as rocks. This harmony stuff is annoying.

Still preliminary, just got working today, but testing is really good. Qwen3-30b-a3b-thinking-2507 + roo code or qwencode, 80,000 context, unsloth q4_k_xl, flash attention, kcache and vcache quant at Q8.

>Runtime/tooling: e.g., Ollama, LM studio, etc.

LM studio. I need vulkan for my setup. rocm is just a pain in the ass. They need to support way more linux distros.

24gb vram.

1 comments

briansun 310 days ago

Super useful config dump—thanks. Do you have wall‑clock numbers for prefill/gen tokens/sec and power draw on the 24GB card for those three setups? Also curious where quality starts to degrade vs. context length in your tests.

link