Hacker News new | ask | show | jobs
by aszen 100 days ago
Seems like you are testing llms genric abilities rather than your actual agent logic.

Llms are like vendor code you don't need to test them yourself people already created benchmarks for that.

1 comments

No they haven’t. The benchmarks suck, because they are cheap knockoffs instead of comprehensive experiments.

LLMs are poorly tested by vendors. They literally can’t afford to test them, so they force us to do it.