| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by disgruntledphd2 312 days ago

> By now, the main reason people expect AI progress to halt is cope. People say "AI progress is going to stop, any minute now, just you wait" because the alternative makes them very, very uncomfortable.

OK, so where is the new data going to come from? Fundamentally, LLMs work by doing token prediction when some token(s) are masked. This process (which doesn't require supervision hence why it scaled) seems to be fundamental to LLM improvement. And basically all of the AI companies have slurped up all of the text (and presumably all of the videos) on the internet. Where does the next order of magnitude increase in data come from?

More fundamentally, lots of the hype is about research/novel stuff which seems to me to be very, very difficult to get from a model that's trained to produce plausible text. Like, how does one expect to see improvements in biology (for example) based on text input and output.

Remember, these models don't appear to reason much like humans, they seem to do well where the training data is sufficient (interpolation) and do badly where there isn't enough data (extrapolation).

I'd love to understand how this is all supposed to change, but haven't really seen much useful evidence (i.e. papers and experiments) on this, just AI CEOs talking their book. Happy to be corrected if I'm wrong.

3 comments

insignificntape 312 days ago

That's not true. And trust me, dude, it scares the living ** out of me, so I wish you were right. Next-token prediction is the AI-equivalent of a baby flailing its arms around and learning basic concepts about the world around it. The AI learns to mimic human behavior and recognize patterns, but it doesn't learn how to leverage this behavior to achieve goals. The pre-training is simply giving the AI a baseline understanding of the world. Everything that's going on now, getting it to think (i.e. talking to itself to solve more complex tasks), or getting it do do maths or coding, is simply us directing that inherent knowledge it's gathered from its pre-training and teaching the AI how to use it.

Look at Claude Code. Unless they hacked into private GitHub/GitLab repos... (which, honestly, I wouldn't put beyond these tech CEO's, see what CloudFlare recently found out about Perplexity as an example), but unless they really did that, they trained Claude 4 on approximately the same data as Claude 3. Yet for some reason its agentic coding skills are stupidly enhanced when compared to previous iterations.

Data no longer seems to be the bottleneck. Which is understandable. At the end of the day, data is really just a way to get the AI to make a predicion and run gradient descent on it. If you can generate for example a bunch of unit tests, you can let the AI freewheel its way into getting them to pass. A kid learns to catch a baseball not by seeing a million examples of people catching balls, but instead by testing their skills in the real world, and gathering feedback from the real world on whether their attempt to catch the ball was successful. If an AI can try to achieve goals and assess whether or not its actions lead to a successful or a failed attempt, who needs more data?

link

fragmede 312 days ago

Fundamentally the bottleneck is on data and compute. If we accept as a given that a) some LLM is bad at writing eg rust code because there's much less of it on the Internet compared to say react js code but that b) the LLM is able to generate valid rust code and c) the LLM is able to "tool use"the rust compiler and a runtime to validate the rust it generates, and iterate until the code is valid, and finally d) use that generated rust code to train on, then it seems that barring any algorithmic improvements in training, that the additional data should allow later versions of the LLM to be better at writing rust code. If you don't hold a-d to be possible then sure, maybe it's just AI CEOs talking their book.

The other fundamental bottleneck is compute. Moore's law hasn't gone away, so if the LLM was GPT-3, and used 1 supercomputer's worth of compute for 3 months back in 2022, and the supercomputer used for training is, say, three times more powerful (3x faster CPU and 3x the RAM), then training on a latest generation supercomputer should lead to a more powerful LLM simply by virtue of scaling that up and no algorithmic changes. The exact nature of the improvement isn't easily back of the envelope calculatable, but even with a laymen's understanding of how these things work, that doesn't seem like an unreasonable assumption on how things will go, and not "AI CEOs talking their book". Simply running with a bigger context window should allow the LLM to be more useful.

Finally though, why do you assume that, absent papers up on arvix, that there haven't and won't be any algorithmic improvements to training and inference? We've already seen how allowing the LLM to take longer to process the input (eg "ultrathink" to Claude) allows for better results. It seems unlikely that all possible algorithmic improvements have already been discovered and implemented. Just because OpenAI et Al aren't writing academic papers to share their discovery with the world and are, instead, preferring to keep that improvement private and proprietary, in order to try and gain a competitive edge in a very competitive business seems like a far more reasonable assumption. With literal billions of dollars on the line, would you spend your time writing a paper, or would you try and outcompete your competitors? If simply giving the LLM longer to process the input before user facing output is returned, what other algorithmic improvements on the inference side on a bigger supercomputer with more ram available to it are possible? Deepseek seems to say there's a ton of optimization still as of yet to be done.

Happy to hear opposing points of view, but I don't think any of the things I've theorized here to be totally inconceivable. Of course there's a discussion to be had about diminishing returns, but we'd need a far deeper understanding is the state of the art on all three facets I raised in order to have an in depth and practical discussion on the subject. (Which tbc I'm open to hearing, though the comments section on HN is probably not the platform to gain said deeper understanding of the subject at hand).

link

ACCount36 312 days ago

We are nowhere near the best learning sample efficiency possible.

Unlocking better sample efficiency is algorithmically hard and computationally expensive (with known methods) - but if new high quality data becomes more expensive and compute becomes cheaper, expect that to come into play heavily.

"Produce plausible text" is by itself an "AGI complete" task. "Text" is an incredibly rich modality, and "plausible" requires capturing a lot of knowledge and reasoning. If an AI could complete this task to perfection, it would have to be an AGI by necessity.

We're nowhere near that "perfection" - but close enough for LLMs to adopt and apply many, many thinking patterns that were once exclusive to humans.

Certainly enough of them that sufficiently scaffolded and constrained LLMs can already explore solution spaces, and find new solutions that eluded both previous generations of algorithms and humans - i.e. AlphaEvolve.

link

dvfjsdhgfv 312 days ago

I don't think anybody argues there will be no progress. We just disagree about the shape of the curve.

link