| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by luodaint 34 days ago

Benchmarks measure turn-level capabilities: you feed a task into the system and then grade the result. Capability for production-level usage concerns session-level decision making: does the agent know when to stop editing, retain the right amount of context, or go back and reread the file if the state has changed?

This is not a property of the model, but a property of the discipline; it can be operationalized by what you have documented before the session begins. Without "stop editing where you can no longer follow your changes to the spec" and "go back and read the migration file before changing the schema," there is nothing to halt the process until it fails integration.

Those teams who get consistent results independent of the model being used typically do so because they have operationalized their discipline first. Those switching out models monthly tend to expect the model to supply them.