Hacker News new | ask | show | jobs
by ilaksh 1114 days ago
Incredibly, they seem to have used several different LLMs, yet made no distinction between the particular AI models used in the analysis. Amazing that they would not realize there is a huge difference in capabilities.

They also did not seem to consider the different performance of individual prompts.