Hacker News new | ask | show | jobs
by stephantul 34 days ago
It was directed at the parent who implied that we didn’t think about this.

I agree with your point about the evals and how you can get discontinuities: good search can be worse than bad search when agents can do many searches. We’re working on it

1 comments

When you share them, please also share the setup for people to easily rerun them. Nearly every eval I've seen shares the llm session transcript but not the actual harness setup etc. that they used.