| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zachdotai 58 days ago
	I wrote about this recently here: https://fabraix.com/blog/adversarial-cost-to-exploit I think the core issue is in static benchmarks and the community needs to start moving beyond measuring pass/fail (which worked when agents were incapable of doing much of the work) to dynamic evals that simulate more how we evaluate humans.