Hacker News new | ask | show | jobs
by djohnston 739 days ago
> It will break in weird ways that makes you feel like you are trying to nail Jello to a tree

Probably the best description of working with LLM agents I've read

2 comments

It gets more interesting when you get to benchmarking your prompts for accuracy. If you don't have an evaluation set you are flying blind. Any model update or small fix could break edge cases while you don't know.
We are using benchmarking on our own eval sets, which makes it easier to measure the variance that I’ve found impossible to eliminate.
Make sure you don’t upload that evaluation set to any service that resells data (or gets scraped) for LLM training!
Came here to say the same thing, it sums it up perfectly