| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by woeirua 244 days ago
	There's a huge disconnect between what the benchmarks are showing and what the day-to-day experience of those of us using LLMs are experiencing. According to SWE-bench, I should be able to outsource a lot of tasks to LLMs by now. But practically speaking, I can't get them to reliably do even the most basic of tasks. Benchmaxxing is a real phenomenon. Internal private assessments are the most accurate source of information that we have, and those seem to be quite mixed for the most recent models.

1 comments

jzymbaluk 244 days ago

How ironic that these LLM's appear to be overfitting to the benchmark scores. Presumably these researchers deal with overfitting every day, but can't recognize it right in front of them

link

woeirua 244 days ago

I'm sure they all know it's happening. But the incentives are all misaligned. They get promotions and raises for pushing the frontier which means showing SOTA performance on benchmarks.

link