| > This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD). The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs. For running async work and background tasks the prompt processing and token generation speeds matter less, but a lot of Mac Studio buyers have discovered the hard way that it's not going to be as responsive as working with a model hosted in the cloud on proper hardware. For most people without hard requirements for on-site processing, the best use case for this model would be going through one of the OpenRouter hosted providers for it and paying by token. > This beats the latest Sonnet while running locally Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead. |
This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here
My notes so far:
"us.anthropic.claude-sonnet-4-6" # working, good results
"us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions
"us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results
"us.anthropic.claude-opus-4-5-20251101-v1:0"
"us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive
"amazon.nova-pro-v1:0" # completely fails
"openai.gpt-oss-120b-1:0" # tool calling broken
"zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet
"minimax.minimax-m2.5" # didn't diagnose correctly
"zai.glm-4.7" # good results but high tool call count, more expensive than Sonnet
"mistral.mistral-large-3-675b-instruct" # misdiagnosed--somehow claimed a Prometheus scrape issue was involved
"moonshotai.kimi-k2.5" # identified the right endpoints but interpreted trace data/root cause incorrectly
"moonshot.kimi-k2-thinking" # identified endpoint, 1 correct root cause, 1 missing index hallucination
Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching
The Kimi ones were close to working but didn't quite make the mark