|
|
|
|
|
by dial481
87 days ago
|
|
We audited the LoCoMo benchmark (one of the most cited eval for LLM agent memory) and found 99 score-corrupting errors in 1,540 questions (6.4%). Separately, we tested the LLM judge with adversarially generated wrong answers, it accepted 62.81% of vague-but-topical wrong answers. Some published system scores barely clear that bar.
Full audit with methodology, all 99 errors documented, and reproducible scripts. |
|