Y
Hacker News
new
|
ask
|
show
|
jobs
by
irthomasthomas
336 days ago
Likely they trained on test. Grok 3 had similarly remarkable benchmark scores but fell flat in real use.