|
|
|
|
|
by dataviz1000
48 days ago
|
|
> there was 1 run per prompt per arm My understanding is that there was only 1 run per configuration? If that is correct, because of the run-to-run variability, it really doesn't say much. It will take several trails per prompt per arm before it will look like it is stabilizing on a plot. It is prohibitively expensive so I've been running same prompt, same model 5 times in order to get a visual understanding of performance. Someone did the same with lambda calculus yesterday. I wanted to make the point about how much run-to-run variability and difference in cost with the same prompt with the same model running only 5 trials. I classified each of the thinking steps using Opus 4.6 (costs ~$4 in tokens per run just for that) and plotted them with custom flame graphs. [0] When the run-to-run variability is between 8,163 and 17,334 tokens none of these tests mean that much. [0] https://adamsohn.com/lambda-variance/ |
|