Hacker News new | ask | show | jobs
by PeterisP 1119 days ago
I would distrust the currently available benchmarks, as recent research (gah, can't remember the paper title) indicates that for many benchmarks at least some of the data splits have leaked into model training data; and there's some experience with the open source models which match an OpenAI model on the benchmark scores but subjectively feel much worse than that model on random questions.
1 comments

I’m telling you from looking at this closely that there is substantial evidence solely from new, never seen before prompts that GPT4 is by far the best, ChatGPT/Claude is second, with other anthropic, Vicuña, etc bringing up the rear