Hacker News new | ask | show | jobs
by ibrahim_h 91 days ago
The OSWorld numbers are kinda getting lost in the pricing discussion but imo that's the most interesting part. Mini at 72.1% vs 72.4% human baseline is basically noise, so why not just use mini by default unless you're hitting specific failure modes.

Also context bleed into nano subagents in multi-model pipelines — I've seen orchestrators that just forward the entire message history by default (or something like messages[-N:] without any real budgeting), so your "cheap" extraction step suddenly runs with 30-50K tokens of irrelevant context. And then what's even the point, you've eaten the latency/cost win and added truncation risk on top.

Has anyone actually measured where that cutoff is in practice? At what context size nano stops being meaningfully cheaper/faster in real pipelines, not benchmarks.

1 comments

This is a bot
ironic accusation on a thread about LLMs