| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mg 221 days ago

If a model is not making use of the whole context window - shouldn't that be very noticeable when the prompt is code?

For example when querying a model to refactor a piece of code - would that really work if it forgets about one part of the code while it refactors another part?

I concatenate a lot of code files into a single prompt multiple times a day and ask LLMs to refactor them, implement features or review the code.

So far, I never had the impression that filling the context window with a lot of code causes problems.

I also use very long lists of instructions on code style on top of my prompts. And the LLMs seem to be able to follow all of them just fine.

1 comments

MallocVoidstar 221 days ago

I don't think there are any up-to-date leaderboards, but models absolutely degrade in performance the more context they're dealing with.

https://wandb.ai/byyoung3/ruler_eval/reports/How-to-evaluate...

>Gpt-5-mini records 0.87 overall judge accuracy at 4k [context] and falls to 0.59 at 128k.

And Llama 4 Scout claimed a 10 million token context window but in practice its performance on query tasks drops below 20% accuracy by 32k tokens.

link

mg 221 days ago

That makes me wonder if we could simply test this by letting the LLM add or multiply a long list of numbers?

Here is an experiment:

https://www.gnod.com/search/#q=%23%20Calcuate%20the%20below%...

The correct answer:

    Correct:    20,192,642.460942328

Here is what I got from different models on the first try:

    ChatGPT:    20,384,918.24
    Perplexity: 20,000,000
    Google:     25,167,098.4
    Mistral:    200,000,000
    Grok:       Timed out after 300s of thinking

link

gcanyon 221 days ago

> Do not use a calculator. Do it in your head.

You wouldn't ask a human to do that, why would you ask an LLM to? I guess it's a way to test them, but it feels like the world record for backwards running: interesting, maybe, but not a good way to measure, like, anything about the individual involved.

link

throwuxiytayq 221 days ago

I’m starting to find it unreasonably funny how people always want language models to multiply numbers for some reason. Every god damn time. In every single HN thread. I think my sanity might be giving out.

link

solatic 221 days ago

A model, no, but an agent with a calculator tool?

Then there's the question of why not just build the calculator tool into the model?

link

KristoAI 221 days ago

Since grok 4 fast got this answer correct so quickly, I decided to test more.

Tested this on the new hidden model of ChatGPT called Polaris Alpha: Answer: $20,192,642.460942336$

Current gpt-5 medium reasoning says: After confirming my calculations, the final product (P) should be (20,192,642.460942336)

Claude Sonnet 4.5 says: “29,596,175.95 or roughly 29.6 million”

Claude haiku 4.5 says: ≈20,185,903

GLM 4.6 says: 20,171,523.725593136

I’m going to try out Grok 4 fast on some coding tasks at this point to see if it can create functions properly. Design help is still best on GPT-5 at this exact moment.

link

jarek83 221 days ago

Isn't that LLMs are not designed to do calculations?

link

cluckindan 221 days ago

They are not LMMs, after all…

Neither are humans.

But humans can still do it.

link