Hacker News new | ask | show | jobs
by lbrito 149 days ago
>If the argument is sustainability of training, I'm skeptical we need these payment models.

That seems to be the argument: LLM adoption leads to drop of organic training data, leading LLMs to eventually plateau, and we'll be left without the user-generated content we relied on for a while (like SO) and with subpar LLM. That's what I'm getting from the article anyway.

2 comments

There are so many things wrong with the points this article repeats, but those are soundbites at this point so I'm not sure one can even argue against them anymore.

Still, for the one about organic data (or "pre-war steel") drying out, it's not a threat to model development at all. People repeating this point don't realize that we already have way more data than we need. We got to where we are by brute-forcing the problem - throwing more data at a simple training process. If new "pristine" data were to stop flowing now, we still a) have decent pre-trained base models, and a dataset that's more than sufficient to train more of them, and b) lots of low-hanging fruits to pick in training approaches, architectures and data curation, that will allow to get more performance out of same base data.

That, and the fact that synthetic data turned out to be quite effective after all, especially in the latter phases of training. No surprise there, for many classes of problems this is how we learn as well. Anyone who has experience studying math for maturity exam / university entry exams knows this: the best way to learn is to solve lots of variations of the same set of problems. These variations are all synthetic data, until recently generated by hand, but even their trivial nature doesn't make them less effective at teaching.

>We got to where we are by brute-forcing the problem

This has been a bit of a concern of mine. That we have to do things the hard way for a long time, and in doing so make a massive amount of fast hardware. Then we get some breakthru that massively drops the amount of compute necessary, the surplus we suddenly have may lead to some kind of AI capability explosion.

The article gets the part about organic data dying off right. Look at Google SERP's for an example. Almost nobody clicks through to the source anymore, so ad revenue is drying up for them and people are publishing less or publishing in places that pay them directly and live behind a paywall like Medium. Which means Google has less data to work with.

That said, what it misses is that the AI prompts themselves become a giant source of data. None of these companies are promising not to use your data, and even if you don't opt-in the person you sent the document/email/whatever to will because they want it paraphrased or need help understanding it.

>AI prompts themselves become a giant source of data.

Good point, but can it match the old organic data? I'm skeptical. For one, the LLM environment lacks any truth or consensus mechanism that the old SO-like sites had. 100s of users might have discussed the same/similar technical problem with an LLM, but there's no way (afaik) for the AI to promote good content and demote bad ones, as it (AI) doesn't have the concept of correctness/truth. Also, the old sites were two-sided, with humans asking _and_ answering questions, while they are only on the asking side with AI.

> (AI) doesn't have the concept of correctness/truth

They kind of do, and it's getting better every day. We already have huge swatches of verifiable facts available to them to ground their statements in truth. They started building Cyc in 1984, and Wikipedia just signed deals with all the major players.

The problem you're describing isn't intractable, so it's fairly certain that someone will solve it soon. Most of the brightest minds in society are working on AI in some form now. It's starting to sound trite, but today's AI's really are the worst that AI will ever be.

“ Most of the brightest minds in society are working on AI in some form now.”

Source? I haven’t met one intelligent person working on AI. The smartest people are being ground into dust. They’re being replaced by pompous overconfident people such as yourself.

> I haven’t met one intelligent person working on AI.

I get the impression that you don't meet a lot of people in general.

> 100s of users might have discussed the same/similar technical problem with an LLM, but there's no way (afaik) for the AI to promote good content and demote bad ones, as it (AI) doesn't have the concept of correctness/truth

The LLM doesn't but reinforcement does. If someone keeps asking the model how to fix the problem after being given an answer, the answer is likely wrong. If someone deletes the chat after getting the answer, it was probably right.

AI is an entropy machine.

Those AI prompts that become data for the AI companies is yet another thing that the human creators used to understand what people wanted, topics to explore, feedback on what they hadn't communicated well enough. That 'value' is AI stealing yet more energy from the system resulting in even less/less valuable human creation.