Hacker News new | ask | show | jobs
by campbel 842 days ago
Opus got it correct for me. Seems like there is correct and incorrect responses from the models on this. I think testing 1 question 1 time really isn't worth much for an accurate representation of capability.