Y
Hacker News
new
|
ask
|
show
|
jobs
by
avbanks
488 days ago
I still find 3.5 Sonnet the best for my coding tasks (better than o1, o3-mini, and R1). The other models might be trying to game system and fine tune the models for the benchmarks.
1 comments
czk
487 days ago
Would love to know just how overfit a lot of them are on these benchmarks
link