|
|
|
|
|
by Reubend
64 days ago
|
|
Because the website doesn't seem to show any sample size of runs, I assume they ran it once across the suite. The models are nondeterministic, and therefore it's pretty normal for different runs to give different results. I don't see this as evidence that Opus 4.6 has gotten worse. |
|
And how is that an excuse?
I don't care about how good a model could be. I care about how good a model was on my run.
Consequently, my opinion on a model is going to be based around its worst performance, not its best.
As such, this qualifies as strong evidence that Opus 4.6 has gotten worse.