|
|
|
|
|
by roosgit
127 days ago
|
|
Have you tried other local models? The 14B Q4_K_M needs 9GB, but Q3_K_M is 7.3GB. But you also need some room for context. Still, maybe using `--override-tensor` in llama.cpp would get you a 50% improvement over "naively" offloading layers to the GPU. Or possibly GPT-OSS-20B. It's 12.1GB in MXFP4, but itβs a MOE model so only a part of it would need to be on the GPU. On my dedicated 12GB 3060 it runs at 85 t/s, with a smallish context. I've also read on Reddit some claims that Qwen3 4B 2507 might be better than 8B, because Qwen never released a "2507" update for 8B. |
|
I've been on Qwen3 8B mostly because it was "good enough" for the mechanical stages (scanning, scoring, dedup) and I didn't want to optimize the local model before validating the orchestration pattern itself. Now that the pipeline is proven, experimenting with the local model is the obvious next lever to pull.
The Qwen3 4B 2507 claim is interesting β if the quality holds for structured extraction tasks, halving the VRAM footprint would open up running two models concurrently or leaving more room for larger contexts. Worth testing.
Thanks for the pointers β this is exactly the kind of optimization I haven't had time to dig into yet.