|
|
|
|
|
by Aurornis
146 days ago
|
|
Non-determinism isn’t the same as degradation. The non-determinism means that even with a temperature of 0.0, you can’t expect the outputs to be the same across API calls. In practice people tend to index to the best results they’ve experienced and view anything else as degradation. In practice it may just be randomness in either direction from the prompts. When you’re getting good results you assume it’s normal. When things feel off you think something abnormal is happening. Rerun the exact same prompts and context with temperature 0 and you might get a different result. |
|
To say that this measurement is bad because the server might just be overloaded completely misses the point. The point is to see if the model sometimes silently performs worse. If I get a response from "Opus", I want a response from Opus. Or at least want to be told that I'm getting slightly-dumber-Opus this hour because the server load is too much.