Hacker News new | ask | show | jobs
by nijave 58 days ago
>Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.

This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here

My notes so far:

"us.anthropic.claude-sonnet-4-6" # working, good results

"us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions

"us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results

"us.anthropic.claude-opus-4-5-20251101-v1:0"

"us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive

"amazon.nova-pro-v1:0" # completely fails

"openai.gpt-oss-120b-1:0" # tool calling broken

"zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet

"minimax.minimax-m2.5" # didn't diagnose correctly

"zai.glm-4.7" # good results but high tool call count, more expensive than Sonnet

"mistral.mistral-large-3-675b-instruct" # misdiagnosed--somehow claimed a Prometheus scrape issue was involved

"moonshotai.kimi-k2.5" # identified the right endpoints but interpreted trace data/root cause incorrectly

"moonshot.kimi-k2-thinking" # identified endpoint, 1 correct root cause, 1 missing index hallucination

Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching

The Kimi ones were close to working but didn't quite make the mark

1 comments

" it supports prompt caching" May I ask if you checked that? I use "{"cachePoint": { "type": "default" }" and I found 2 things: * 1) even if stated in the Doco, Bedrock Converse API does not allow 1hr expiry time, only 5m - gives error when attempted; * 2) Bedrock Converse API does accept up to 4 cachePoint's but does NOT cache and returns zeroes. LOL. It was confirmed by some other people on Github. (Note: VertexAI does cache properly reducing the bill drastically, so I use Vertex instead of OpenRouter.)
I had Claude Code pull the OTEL trace and calculate cost based on token counts in the responses. I'll double check later today tho if I remember

Edit: I do see the first request shows 0 cache read, 7k cache write tokens. The next request shows 7k cache read, 900 cache write tokens. The agent run summary is:

usage {

cache_read_input_tokens 244586

cache_write_input_tokens 38399

completion_tokens 8131

input_tokens 1172

output_tokens 8131

prompt_tokens 1172

total_tokens 292288

}

I do see a recent issue in the Strands Agent issue tracker about 1hr TTL getting ignored and defaulting to 5m TTL. I haven't validated cache TTL but these agent runs take ~2-3m so a 5m TTL is sufficient.

I also checked the AWS bill and see separate Usage SKUs

USE1-MP:USE1_CacheWriteInputTokenCount-Units $0.34

USE1-MP:USE1_OutputTokenCount-Units $0.27

USE1-MP:USE1_CacheReadInputTokenCount-Units $0.16

USE1-MP:USE1_InputTokenCount-Units $0.01