|
|
|
|
|
by MallocVoidstar
216 days ago
|
|
I don't think there are any up-to-date leaderboards, but models absolutely degrade in performance the more context they're dealing with. https://wandb.ai/byyoung3/ruler_eval/reports/How-to-evaluate... >Gpt-5-mini records 0.87 overall judge accuracy at 4k [context] and falls to 0.59 at 128k. And Llama 4 Scout claimed a 10 million token context window but in practice its performance on query tasks drops below 20% accuracy by 32k tokens. |
|
Here is an experiment:
https://www.gnod.com/search/#q=%23%20Calcuate%20the%20below%...
The correct answer:
Here is what I got from different models on the first try: