| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sanderjd 1 day ago
	Aren't there benchmarks that measure at the harness level as well?

1 comments

theshrike79 1 day ago

How would you benchmark "agent harness communicates with user clearly" it's 100% a feels measurement.

link

sanderjd 1 day ago

I mean, in my experience some of this stuff is way closer to table stakes things than that. Like "the tool call didn't get totally confused" more than "did the communication with the user feel good".

link