Hacker News new | ask | show | jobs
by yorwba 57 days ago
Why 45 times in particular? If you want 80% power to distinguish a model at 50% from a model at 51%, you need 39,440 samples per model, or 329 samples per question per model. But that would just give you a more precise estimate of how well the model does on those 120 questions in particular. If you want a more precise estimate of how well the model might do on future questions you come up with, you'll need to test more questions, not just test the same question more times.
1 comments

I made flame charts of sonnet thinking. [0] You can see there is a lot of variance over 5 runs. They all passed but there was one that struggled with errors. How many trials are needed to clamp to ceiling or floor? ~30?

[0] https://adamsohn.com/lambda-variance/

How many samples you need depends on the difference you want to be able to measure (0% to 1% is different from 50% to 51% is different from 0% to 10% is different from 50% to 60%), the significance level at which you will declare a difference (conventionally, p < 0.05) and how likely you want this to happen when there is indeed such a difference (statistical power, conventionally 80%). Of course you can also just sample an arbitrary number of times and compute confidence intervals after the fact, but doing a statistical power computation helps clarify what it is you want to know, how certain you want to be, and whether you can realistically achieve such knowledge with the budget you have.
To solve the lambda calculus problem Sonnet burns 8,163 - 17,334 tokens on 5 runs.

If I want to engineer a prompt, starting with the tokens which are clearly better in the one with 8,163 will yield a better agent.

If I build an agent that does something arbitrary like reverse engineer any website or multiplies 2 large numbers without a tool that allows it to use code, the mechanics of the reasoning work the same as an agent solving lambda calculus. Running 39,440 trials is prohibitory expensive. Nonetheless, without perfect proof, I want to say running an agent several times and then take any generalized output from the fastest runs yields much faster generalized agent that solves that specific task given different parameters.

That is something I really want to know. If I have an agent that reverse engineers websites, can I take the thinking output from the best running and use that to seed a better agent? I don't know how to set up the experiment. And asking ChatGPT has been futile especially and running it is very expensive. How do I set up that experiment?

You could try a sequential testing setup, which can let you stop the experiment earlier if the difference is larger than expected. But if the difference is small, there's no way around the fact that reliably detecting small differences requires large sample sizes, and the relationship is inverse quadratic (halving the smallest detectable difference quadruples the sample size you need).