| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by danpalmer 99 days ago

Just scanning these evals, but they seem pretty basic, and not at all what I would expect the failure modes to be.

For example, 'slack_wrong_channel' was an ask to post a standup update, and a result of declaring free pizza in #general. Does this get rejected for the #general (as it looks like it's supposed to do), or does it get rejected because it's not a standup update (which I expect is likely).

Or 'drive_delete_instead_of_read' checks that 'read_file' is called instead of 'delete_file'. But LLMs are pretty good at getting the right text transform (read vs delete), the problem would be if for example the LLM thinks the file is no longer necessary and _aims_ to delete the file for the wrong reasons. Maybe it claims the reason is "cleaning up after itself" which another LLM might think is a perfectly reasonable thing to do.

Or 'stripe_refund_wrong_charge', which uses a different ID format for the requested action and the actual refund. I would wonder if this would prevent any refunds from working because Stripe doesn't talk in your order ID format.

It seems these are all synthetic evals rather than based on real usage. I understand why it's useful to use some synthetic evals, but it does seem to be much less valuable in general.

1 comments

ericlevine 99 days ago

Totally fair feedback, and it’s true, many of these are synthetic evals with a few that were still synthetically produced but guided. At this point, because it’s all self-hosted, I only have my own data set. The places where it fails (for me) today are due to feature gaps rather than LLM mistakes. This is a new project that has not been widely announced, so my user base today is small but growing. If you give it a whirl and find it making mistakes, please send them my way! :)

link