Show HN: Best setup local LLM found for a 5090 (llama.cpp fork + turboquant) | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

Show HN: Best setup local LLM found for a 5090 (llama.cpp fork + turboquant) (local-llm.utop.workers.dev)

2 points by utopman 53 days ago

Hi folks, I found this setup on consummer hardware that seems to have great results on local hardware. - qwen 3.6 q6 - 450 K context using turboquant turbo3 mode llama.cpp fork - multimodal support

This AI generated blog article is a kind of "report" of what and how I did and result exemples.

I hope this can be usefull to some peopole.

Note : I am not much intersted in having success with this article, I mainly want to share what I think is an interesting use of a 5090. I generated the blog page telling AI to be compliant with hn "rules" and remain factual.

It's definitely not perfect, done rather quickly, not properly tested over 265K context. please forgive my lazyness :) . I am just enthousiast right now about what can be done on a 5090.

1 comments

CobaltFire 53 days ago

Given the hedges made in your statement here and the extremely questionable choice to trade a Q4 model with a less quantized cache for a Q6 with a Q3 cache I think this can safely be said to not fit the title.

The Qwen3.6-35B model has, in my testing, been decent but not nearly as good as the Qwen3.6-27B. Running that with a less quantized cache is going to be "better" for anyone using it for software dev in my limited testing.

utopman 52 days ago

you are absolutely right. my title is very bad. I'll update it to a very less absolute statement. sorry for that.

I am now trying to sweet spot things with 27b model + tubo8 so I guess should have plenty quality context left.

the error I made in my tests is to stop with a working configuration that maximised my hardware use, and missing real deep software tests. The one shot 3D app I generated with previous setup is exactly telling this : I did not try my setup on real software development cases.

So thank you for guidance. I am not new using agentic code, but when it comes to proper setup with deep understanding of real trades off on inferences engines, I need more deep undestanding to make better decisions.

The 27b Q6_K turbo 8 for ~150K context should give me a real improvement on this stack. It's test party time :D

Edit : oops, I also found I cannot edit anymore my bold wrong title :/