|
|
|
|
|
by shrumm
346 days ago
|
|
The ‘with evidence’ part is key as simonw said.
One anecdote from evals at Cleric - it’s rare to see a new model do better on our evals vs the current one. The reality is that you’ll optimize prompts etc for the current model. Instead, if a new model only does marginally worse - that’s a strong signal that the new model is indeed better for our use case. |
|