|
|
|
|
|
by c5huracan
102 days ago
|
|
The "no meaningful benchmark for good agentic session performance" point resonates. Success varies so much by task type that a single metric is almost meaningless. A 60-second documentation lookup and a 30-minute refactoring session could both be successes. Curious what shape the benchmark takes. Are you thinking per-task-type baselines, or something more like an aggregate efficiency score? |
|