| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dial481 87 days ago
	We audited the LoCoMo benchmark (one of the most cited eval for LLM agent memory) and found 99 score-corrupting errors in 1,540 questions (6.4%). Separately, we tested the LLM judge with adversarially generated wrong answers, it accepted 62.81% of vague-but-topical wrong answers. Some published system scores barely clear that bar. Full audit with methodology, all 99 errors documented, and reproducible scripts.

1 comments

PaulHoule 87 days ago

I've worked in IR and this has been true about TREC data sets from the beginning and it has also been true about visual data sets. The first step to build a world beating commercial system has been to clean up the garbage in open evals to raise the possible accuracy ceiling.

link

dial481 86 days ago

That's encouraging to hear from someone with IR experience, thanks. Agree completely.

link