Hacker News new | ask | show | jobs
by satisfice 108 days ago
It’s called testing. And from the reports and comments, there doesn’t seem to be much of it happening. The reason is: it’s quite expensive to do well.

I find that for every hypothesis I might have to run a thousand prompts to collect enough data for a conclusion. For instance, to discover how reliably different models can extract noun phrases from a text: hours of grinding. Even so that was for a small text. I haven’t yet run the process on a large text.

2 comments

Seems like you are testing llms genric abilities rather than your actual agent logic.

Llms are like vendor code you don't need to test them yourself people already created benchmarks for that.

No they haven’t. The benchmarks suck, because they are cheap knockoffs instead of comprehensive experiments.

LLMs are poorly tested by vendors. They literally can’t afford to test them, so they force us to do it.

Yeah it's a super tedious process and I was hoping that _maybe_ there is a tool out there that can help with this.