Hacker News new | ask | show | jobs
by actinium226 340 days ago
Very nice article. The point about mathematical reliability is interesting. I generally agree with it, but humans aren't 100% reliable, or even 99% reliable, so how do we manage to create things like the Linux kernel or the Mars landers without AI? Clearly we have some sort of goal-based self-correction mechanism. I wonder if there's research into AI on that thread?
5 comments

> Clearly we have some sort of goal-based self-correction mechanism.

Humans can try things, learn, and iterate. LLMs still can't really do the second thing, you can feed back an error message into the prompt but the learning isn't being added to its weights so its knowledge doesn't compound with experience like it does for us.

I think there are still a few theoretical breakthroughs needed for LLMs to achieve AGI and one of them is "active learning" like this.

Additionally, LLMs still don’t truly understand anything, which is why they flounder so badly with e.g. writing code for a programming language or framework that it hasn’t seen a large enough set of training data for. Humans on the other hand do understand and generalize shared knowledge well, which is why we’re much better at handling that type of scenario.

More specific to agents, humans can also figure out how to use tools on the fly (even in the absence of documentation) where LLMs need human-built MCPs. This is also a significant limiting factor.

I’ve found claude to be very helpful when both writing and debugging code written in a language i’m currently building. I just make sure to load the spec into its context first and that seems to be enough for it to get a general understanding.
Everyone criticizing AI for not "understanding" anything... yet, as you found, and many others have also shown before, explain something to them and they bloody well look like they do understand it. I am still in awe at what LLMs can do, TBH. Over the last few months, the main problem with them: of confidently making shit up, seems to be getting much less of a problem... it's still not solved, but if things keep improving I wouldn't be surprised they will have controls that ensure they stop doing that, and when that happens people will be able to trust what they say/write much more... and perhaps that will be a turning point when complaints like in this post will be hard to take seriously.
The issue is that their “understanding” following an explanation is quite shallow. They often miss many connections and underlying principles that a human would grasp right away, needing to be spoon-fed these things to fill the gap.

That’s not to say they’re not useful in their current state. They are. However, I believe it’s becoming clear that there’s a hard ceiling to how capable LLMs in their current form can become and it’s going to take something radically different to break through.

> controls that ensure they stop doing that

I'm not sure you can do that. As humans, we need to make things up in order to have theories to test. Like back in the day before Einstein when people thought that light traveled through an "aether" whose properties we needed to figure out how to measure, or today when we can't explain the mass imbalance of the universe so we create this concept called "dark matter."

Also, in my experience the problem has been getting worse, or at least not better. I asked Claude 3.7 some time ago how to restore a snapshot to an active database on AWS, and it cheerfully told me to go to the console and press the button. Except there is no button, because AWS docs specifically say you can't restore a snapshot to an active database.

Compounding with learn and iterate, humans also build abstractions which significantly shorten the number of steps required. These are more expressive programming languages, compilers and toolchains. We also build engines, libraries, DSLs and invent appropriate data-structures to simplify the landscape or reuse existing work. Besides abstractions, we build tools like better type systems, error testing and borrow checkers to help eliminate certain classes of errors. Finally, after all is said and done, we still have QA teams and major bugs.
100% and it seems like we need a whole new architecture to get there, because right now training a model takes so much time.

At the risk of making a terrible analogy, right now we're able to "give birth" to these machines after months of training, but once they're born, they can't really learn. Whereas animals learn something new every day, got to sleep, clean up their memories a bit, deleting some, solidifying others, and waking up with an improved understanding of the world.

Maybe you're on to something. We need AI lions which will eat the models which don't learn or adapt enough.
I love the idea of AI lions, but you still need to find a way to allow models to continue "learning" after they're born—which is PhD worthy.

Right now we train AI babies, dump them in the wild... and expect them to have all the answers.

You could instruct the LLM to formulate a “lesson” based on the error and add this to the tool instructions for future runs.
This isn’t practical at scale. You’ll run into too many novel lessons and burn through too many tokens setting up context.
At scale you need to use more tricks. For example, only inject examples if the tool is going to be needed. Or amass lessons, then ask the LLM to summarize them to prune redundant information before it is used in the context.
Humans build theories of how things work. llms dont. Theories are deterministic and symbolic. Take the turing machine for example as a theory of computation in general, euclidean geometry as a theory for space, and newtonian mechanics as a theory for motion

Even for software applications like the Linux kernel, there would have been a theory in Linus' head - for example of what an operating system is, and how it should work.

A theory gives 100% correct predictions. Although the theory itself may not model the world accurately. Such feedback between the theory, and its application in the world causes iterations to the theory. From newtonian mechanics to relativity etc. From euclidean geometry to geometry of curved spaces etc.

Long story short, the LLM is a long way away from any of this. And to be fair to LLMs, the average human is not creating theories, it takes some genius to create them (newton, turing, etc). The average human is trading memes on social media.

I believe there was an article/paper in the last few months about that exact issue

Someone was saying that with an increasing number of attempts, or increasing context length, LLMs are less and less likely to solve a problem

(I searched for it but can't find it)

That matches my experience -- the corrections in long context can just as easily be anti-corrections, e.g. turning something that works into something that doesn't work

---

Actually it might have been this one, but there are probably multiple sources saying the same thing, because it's true:

Context Rot: How Increasing Input Tokens Impacts LLM Performance - https://news.ycombinator.com/item?id=44564248

In this report, we evaluate 18 LLMs, including the state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models. Our results reveal that models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows.

---

As far this question: how do we manage to create things like the Linux kernel or the Mars landers without AI

It's because human intelligence is a totally different thing than LLMs (contrary to what interested people will tell you)

Carmack said there are at least 5 or 6 big breakthroughs left before "AGI", and I think even that is a misleading framing. It's certainly possible that "AGI" will not be reached - there could be hardware bottlenecks, software/algorithmic questions, or other obstacles we haven't thought of

That is, I would not expect AI to create anything like the Linux kernel. The burden of proof is on the people who claim that, not the other way around !!!

I saw your edit with the paper, but when you first mentioned it I thought you might have been referring to the Apple paper that more or less said the same thing.

Speaking of Apple, I just want to get it out there that I'm impressed that they're exhibiting self restraint in this AI era. I know they get bashed for not being "up to speed" with "the rest of the industry," but I believe they're doing this on purpose because they see what slop it is and they'd prefer to scope it down to release something more useful.

Humans aren't 100% reliable but we can build tools that are 100% reliable to verify our predictions.
We don't generate chains of tokens with a constant error rate so errors don't pile up. Don't ask me what we do instead for I have no clue but whatever it is, it works better than next token prediction.

Hey, maybe humans aren't just like LLMs after all.