| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by prmph 218 days ago

Nothing will really work when the models fail at the most basic of reasoning challenges.

I've had models do the complete opposite of what I've put in the plan and guidelines. I've had them go re-read the exact sentences, and still see them come to the opposite conclusion, and my instructions are nothing complex at all.

I used to think one could build a workflow and process around LLMs that extract good value from them consistently, but I'm now not so sure.

I notice that sometimes the model will be in a good state, and do a long chain of edits of good quality. The problem is, it's still a crap-shoot how to get them into a good state.

3 comments

hu3 218 days ago

Check context size.

LLMs become increasingly error-prone as their memory is fills up. Just like humans.

In VSCode Copilot you can keep track of how many tokens the LLM is dealing with in realtime with "Chat Debug".

When it reaches 90k tokens I should expect degraded intelligence and brace for a possible forced sumarization.

Sometimes I just stop LLMs and continue the work in a new session.

link

mstank 218 days ago

In my experience this was an issue 6-8 months ago. Ever since Sonnet 4 I haven’t had any issues with instruction following.

Biggest step-change has been being able to one-shot file refactors (using the planning framework I mentioned above). 6 months ago refactoring was a very delicate dance and now it feels like it’s pretty much streamlined.

link

ewoodrich 218 days ago

I recently ran into two baffling, what felt like GPT 3.5 era completely backwards misinterpretations of an unambiguous sentence once each in Codex and CC/Sonnet a few days apart in completely different scenarios (both very early in the context window). And to be fair, they were notable partially as an "exception that proves the rule" where it was surprising to see but OP's example can definitely still happen in my experience.

I was prepared to go back to my original message and spot an obvious-in-hindsight grey area/phrasing issue on my part as the root cause but there was nothing in the request itself that was unclear or problematic, nor was it buried deep within a laundry list of individual requests in a single message. Of course, the CLI agents did all sorts of scanning through the codebase/self debate/etc in between the request and the first code output. I'm used to how modern models/agents get tripped up by now so this was an unusually clear cut failure to encounter from the latest large commercial reasoning models.

In both instances, literally just restating the exact same request with "No, the request was: [original wording]" was all it took to steer them back and didn't become a concerning pattern. But with the unpredictability of how the CLI agents decide to traverse a repo and ingest large amounts of distracting code/docs it seems much too over confident to believe that random, bizarre LLM "reasoning" failures won't still occur from time to time in regular usage even as models improve given their inherent limitations.

(If I were bending over backwards to be charitable/anthropomorphize, it would be the human failure mode of "I understood exactly what I was asked for and what I needed to do, but then somehow did the exact opposite, haha oops brain fart!" but personally I'm not willing to extend that much forgiveness/tolerance to a failure from a commercial tool I pay for...)

link

PeterFBell 218 days ago

It's complicated. Firstly, don't love that this happens. But the fact you're not willing to provide tolerance to a commercial tool that costs maybe a few hundred bucks a month but are willing to do so for a human who probably costs thousands of bucks a month is revealing of a double standard we're all navigating.

Its like the fallout when a waymo kills a "beloved neighborhood cat". I'm not against cats, and I'm deeply saddened at the loss of any life, but if it's true that (comparable) mile for mile, waymos reduce deaths and injuries, that is a good thing - even if they don't reduce them to zero.

And to be clear, I often feel the same way - but I am wondering why and whether it's appropriate!

link

prmph 218 days ago

For me I was just pointing out some interesting and noteworthy failure modes.

And it matters. If the models struggle sometimes with basic instruction following, they're can quite possibly make insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.

The thing about good abstractions is that you should be able to trust in a composable way. The simpler or more low-level the building blocks, the more reliable you should expect them to be. In LLMs you can't really make this assumption.

link

ewoodrich 217 days ago

I mean, we typically architect systems depending on humans around an assumption of human fallibility. But when it comes to automation, randomly still doing the exact opposite even if somewhat rare is problematic and limits where and at what scale it can be safely deployed without needing ongoing human supervision.

For a coding tool it’s not as problematic as hopefully you vet the output to some degree but it still means I have don’t feel comfortable using them using them as expansively (like the mythical personal assistant doing my banking and replying to emails, etc) as they might otherwise be used with more predictable failure modes.

I’m perfectly comfortable with Waymo on the other hand, but that would probably change if I knew they were driven by even the newest and fanciest LLMs as [toddler identified | action: avoid toddler] -> turns towards toddler is a fundamentally different sort of problem.

link

alienbaby 218 days ago

I'm curious in what kinda if situations you are seeing the model the do opposite of your intention consistently where the instructions were not complex. Do you have any examples?

link

avereveard 218 days ago

Mostly gemini 3 pro when I ask to investigate a bug and provide fixing options (i do this mostly so i can see when the model loaded the right context for large tasks) gemini immediately starts fixing things and I just cant trust it

Codex and claude give a nice report and if I see they're not considering this or that I can tell em.

link

saxenaabhi 218 days ago

fyi that happened to me with codex.

but, why is it a big issue? if it does something bad, just reset the worktree and try again with a different model/agent? They are dirt cheap at 20/m and I have 4 subscription(claude, codex, cursor, zed).

link

avereveard 215 days ago

Same I have multiple subscription and layer them. I use haiku to plan and send queue of task to codex and gemini whose command line can be scripted

The issue to me is that I have no idea of what the code looks like and have to have a reliable first layer model that can summarize current codebase state so I can decide whether the next mutation moves the project forward or reduces technical debt. I can delegate much more that way, while gemini "do first" approach tend to result in many dead ends that I have to unravel.

link

prmph 218 days ago

The issue is that if it's struggling sometimes with basic instruction following, it's likely to be making insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.

link

saxenaabhi 218 days ago

I'm not sure you can make that assumption even when a human wrote that code. LLMs are competing with humans not with some abstraction.

> The issue is that if it's struggling sometimes with basic instruction following, it's likely to be making insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.

Yes, that's why we review all code even when written by humans.

link