Hacker News new | ask | show | jobs
by wizzwizz4 78 days ago
From the article:

> There's a common rebuttal to this, and I hear it constantly. "Just wait," people say. "In a few months, in a year, the models will be better. They won't hallucinate. They won't fake plots. The problems you're describing are temporary." I've been hearing "just wait" since 2023.

We're not trending towards superintelligence with these AIs. We're trending towards (and, in fact, have already reached) superintelligence with computers in general, but LLM agents are among the least capable known algorithms for the majority of tasks we get them to do. The problem, as it usually is, is that most people don't have access to the fruits of obscure research projects.

Untrained children write better code than the most sophisticated LLMs, without even noticing they're doing anything special.

2 comments

> Untrained children write better code than the most sophisticated LLMs, without even noticing they're doing anything special.

I’ll take that bet. How much money would you like to put on this, and we’ll have a neutral third party pick both the untrained child and the LLM.

Let me know.

I'm willing to bet 10% of my net worth on this. But my claim was not about any given untrained child (for instance, a child who does not want to program would do poorly): a fair bet would allow me to choose the child, you to choose the LLM, use a task and programming language of the child's choice, and have a neutral third-party familiar with the programming language judge "better code". (I would, of course, want to ensure that the judge used an appropriate rubric: RLHF can produce a sophisticated turd-polisher. Perhaps the evaluation process could involve modifications made to the program?)

It is (rightly) difficult to get hold of one uninvolved child, for safeguarding reasons, so it would be better to run it as a school (or interschool) competition, where multiple children may participate. For fairness, you may also provide multiple LLM participants (however you define that). The winner of the contest, as determined by the judge, would then determine the winner of the bet ­– unless the winning child had been trained, in which case we would fall back to the next-highest-ranked participant. The number of LLM candidates would be equal to the number of eligible children.

However, I don't see a good way to allow each child to pick a programming language and task, without leaving the competition results incomparable. So perhaps each child should be paired with an LLM, and the judge should determine which submission from each pair is better? But then if I only need one victory (to support my claim), this is clearly unfair. So each pair should be tested enough to determine whether they're consistently better than the LLM… but then we are demanding a lot of the child participants, for no real benefit to them.

If we can agree on a workable protocol, I can try to pull some strings and see if we can make this happen. I could use the money.

The rate of hallucination has gone down drastically since 2023. As LLM coding tools continue to pare that rate down, eventually we’ll hit a point where it is comparable to the rate we naturally introduce bugs as humans programmers.
I wonder how much of the decrease in hallucination is because the models are getting better, and how much is because these massively over-funded companies are adding a bunch of one-off shims at breakneck speed. IE - are they truly improving the cognition, or just monkey-patching the hell out of it?

The recent article where the AI companies are paying experts in the field to help train the models makes me wonder if they're also manually fixing a bunch of post-processing errors as they come up.

LLMs are still making fundamentally the same kinds of errors that they made in 2021. If you check my HN comment history, you'll see I predicted these errors, just from skimming the relevant academic papers (which is to say they're obvious: I'm far from the only person saying this). There is no theoretical reason we should expect them to go away, unless the model architectures fundamentally change (and no, GPT -> LLaMA is not a fundamental change), because they're not removable discontinuities: they're indicative of fundamental capability gaps.

I don't care how many terms you add to your Taylor series: your polynomial approximation of a sine wave is never going to be suitable for additive speech synthesis. Likewise, I don't care how good your predictive-text transformer model gets at instrumental NLP subtasks: it will never be a good programmer (except as far as it's a plagiarist). Just look at the Claude Code source code: if anyone's an expert in agentic AI development, it's the Claude people, and yet the codebase is utterly unmaintainable dogshit that shouldn't work and, on further inspection, doesn't work.

That's not to say that no computer program can write computer programs, but this computer program is well into the realm of diminishing returns.