|
|
|
|
|
by pants2
198 days ago
|
|
The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job. |
|
There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.
The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.