|
|
|
|
|
by reckless
329 days ago
|
|
The aggregate picture only tells you so much. Sites like simonwillison.net/2025/jul/ and channels like https://www.youtube.com/@aiexplained-official also cover new model releases pretty quickly for some "out of the box thinking/reasoning" evaluations. For me and my usage I can really only tell if I start using the new model for tasks I actually use them for. My personal benchmark andrew.ginns.uk/merbench has full code and data on GitHub if you want a staring point! |
|