Hacker News new | ask | show | jobs
by theshrike79 7 days ago
We shouldn't just measure the power of the raw LLM, harnesses matter more and more.

It's like taking the engine out a each car, putting it to a test bed and running it and then making a decision whether the car is good or bad based on the graphs the test bed provided.

You might have the best engine in the world, but if you put it in a shit car, the result is still bad. The seats are squeaky plastic, the infotainment is touch-only and you can't put on your seatbelt without knocking down whatever is in the cupholder.

1 comments

Aren't there benchmarks that measure at the harness level as well?
How would you benchmark "agent harness communicates with user clearly" it's 100% a feels measurement.
I mean, in my experience some of this stuff is way closer to table stakes things than that. Like "the tool call didn't get totally confused" more than "did the communication with the user feel good".