| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MallocVoidstar 216 days ago

I don't think there are any up-to-date leaderboards, but models absolutely degrade in performance the more context they're dealing with.

https://wandb.ai/byyoung3/ruler_eval/reports/How-to-evaluate...

>Gpt-5-mini records 0.87 overall judge accuracy at 4k [context] and falls to 0.59 at 128k.

And Llama 4 Scout claimed a 10 million token context window but in practice its performance on query tasks drops below 20% accuracy by 32k tokens.

1 comments

mg 216 days ago

That makes me wonder if we could simply test this by letting the LLM add or multiply a long list of numbers?

Here is an experiment:

https://www.gnod.com/search/#q=%23%20Calcuate%20the%20below%...

The correct answer:

    Correct:    20,192,642.460942328

Here is what I got from different models on the first try:

    ChatGPT:    20,384,918.24
    Perplexity: 20,000,000
    Google:     25,167,098.4
    Mistral:    200,000,000
    Grok:       Timed out after 300s of thinking

link

gcanyon 216 days ago

> Do not use a calculator. Do it in your head.

You wouldn't ask a human to do that, why would you ask an LLM to? I guess it's a way to test them, but it feels like the world record for backwards running: interesting, maybe, but not a good way to measure, like, anything about the individual involved.

link

throwuxiytayq 216 days ago

I’m starting to find it unreasonably funny how people always want language models to multiply numbers for some reason. Every god damn time. In every single HN thread. I think my sanity might be giving out.

link

solatic 216 days ago

A model, no, but an agent with a calculator tool?

Then there's the question of why not just build the calculator tool into the model?

link

KristoAI 216 days ago

Since grok 4 fast got this answer correct so quickly, I decided to test more.

Tested this on the new hidden model of ChatGPT called Polaris Alpha: Answer: $20,192,642.460942336$

Current gpt-5 medium reasoning says: After confirming my calculations, the final product (P) should be (20,192,642.460942336)

Claude Sonnet 4.5 says: “29,596,175.95 or roughly 29.6 million”

Claude haiku 4.5 says: ≈20,185,903

GLM 4.6 says: 20,171,523.725593136

I’m going to try out Grok 4 fast on some coding tasks at this point to see if it can create functions properly. Design help is still best on GPT-5 at this exact moment.

link

jarek83 216 days ago

Isn't that LLMs are not designed to do calculations?

link

cluckindan 216 days ago

They are not LMMs, after all…

Neither are humans.

But humans can still do it.

link