|
|
|
|
|
by zachdotai
58 days ago
|
|
I wrote about this recently here:
https://fabraix.com/blog/adversarial-cost-to-exploit I think the core issue is in static benchmarks and the community needs to start moving beyond measuring pass/fail (which worked when agents were incapable of doing much of the work) to dynamic evals that simulate more how we evaluate humans. |
|