| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by djohnston 739 days ago
	> It will break in weird ways that makes you feel like you are trying to nail Jello to a tree Probably the best description of working with LLM agents I've read

2 comments

visarga 739 days ago

It gets more interesting when you get to benchmarking your prompts for accuracy. If you don't have an evaluation set you are flying blind. Any model update or small fix could break edge cases while you don't know.

link

djohnston 739 days ago

We are using benchmarking on our own eval sets, which makes it easier to measure the variance that I’ve found impossible to eliminate.

link

amluto 739 days ago

Make sure you don’t upload that evaluation set to any service that resells data (or gets scraped) for LLM training!

link

barrell 739 days ago

Came here to say the same thing, it sums it up perfectly

link