|
|
|
|
|
by ai5iq
64 days ago
|
|
Benchmarks miss the thing that actually matters for agentic use: how does behavior change over a multi-day horizon? A model that scores well on one-shot coding tasks can still make terrible decisions when it has persistent
state and resource constraints. That's where you see the real gaps between models. |
|
(Of course at that point it involves memory and context management and so on, so you're testing the harness as well as the model.)