| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by c5huracan 102 days ago
	The "no meaningful benchmark for good agentic session performance" point resonates. Success varies so much by task type that a single metric is almost meaningless. A 60-second documentation lookup and a 30-minute refactoring session could both be successes. Curious what shape the benchmark takes. Are you thinking per-task-type baselines, or something more like an aggregate efficiency score?