Hacker News new | ask | show | jobs
by sanderjd 1 day ago
Aren't there benchmarks that measure at the harness level as well?
1 comments

How would you benchmark "agent harness communicates with user clearly" it's 100% a feels measurement.
I mean, in my experience some of this stuff is way closer to table stakes things than that. Like "the tool call didn't get totally confused" more than "did the communication with the user feel good".