Hacker News new | ask | show | jobs
by mrcwinn 149 days ago
Pay per crawl of StackOverflow wouldn't encourage me to post more on StackOverflow. (Not that I was anyway.) Presumably you'd need to pay content creators, but that seems quite inefficient:

1. I pay OpenAI 2. OpenAI rev shares to StackOverflow 3. StackOverflow mostly keeps that money, but shares some with me for posting 4. I get some money back to help pay OpenAI?

This is nonsense. And if the frontier labs are right about simulated data, as Tesla seems to have been right with its FSD simulated visualization stack, does this really matter anyway? The value I get from an LLM far exceeds anything I have ever received from SO or an O'Reilly book (as much as I genuinely enjoy them collecting dust on a shelf).

If the argument is "fairness," I can sympathize but then shrug. If the argument is sustainability of training, I'm skeptical we need these payment models. And if the argument is about total value creation, I just don't buy it at all.

2 comments

>If the argument is sustainability of training, I'm skeptical we need these payment models.

That seems to be the argument: LLM adoption leads to drop of organic training data, leading LLMs to eventually plateau, and we'll be left without the user-generated content we relied on for a while (like SO) and with subpar LLM. That's what I'm getting from the article anyway.

There are so many things wrong with the points this article repeats, but those are soundbites at this point so I'm not sure one can even argue against them anymore.

Still, for the one about organic data (or "pre-war steel") drying out, it's not a threat to model development at all. People repeating this point don't realize that we already have way more data than we need. We got to where we are by brute-forcing the problem - throwing more data at a simple training process. If new "pristine" data were to stop flowing now, we still a) have decent pre-trained base models, and a dataset that's more than sufficient to train more of them, and b) lots of low-hanging fruits to pick in training approaches, architectures and data curation, that will allow to get more performance out of same base data.

That, and the fact that synthetic data turned out to be quite effective after all, especially in the latter phases of training. No surprise there, for many classes of problems this is how we learn as well. Anyone who has experience studying math for maturity exam / university entry exams knows this: the best way to learn is to solve lots of variations of the same set of problems. These variations are all synthetic data, until recently generated by hand, but even their trivial nature doesn't make them less effective at teaching.

>We got to where we are by brute-forcing the problem

This has been a bit of a concern of mine. That we have to do things the hard way for a long time, and in doing so make a massive amount of fast hardware. Then we get some breakthru that massively drops the amount of compute necessary, the surplus we suddenly have may lead to some kind of AI capability explosion.

The article gets the part about organic data dying off right. Look at Google SERP's for an example. Almost nobody clicks through to the source anymore, so ad revenue is drying up for them and people are publishing less or publishing in places that pay them directly and live behind a paywall like Medium. Which means Google has less data to work with.

That said, what it misses is that the AI prompts themselves become a giant source of data. None of these companies are promising not to use your data, and even if you don't opt-in the person you sent the document/email/whatever to will because they want it paraphrased or need help understanding it.

>AI prompts themselves become a giant source of data.

Good point, but can it match the old organic data? I'm skeptical. For one, the LLM environment lacks any truth or consensus mechanism that the old SO-like sites had. 100s of users might have discussed the same/similar technical problem with an LLM, but there's no way (afaik) for the AI to promote good content and demote bad ones, as it (AI) doesn't have the concept of correctness/truth. Also, the old sites were two-sided, with humans asking _and_ answering questions, while they are only on the asking side with AI.

> (AI) doesn't have the concept of correctness/truth

They kind of do, and it's getting better every day. We already have huge swatches of verifiable facts available to them to ground their statements in truth. They started building Cyc in 1984, and Wikipedia just signed deals with all the major players.

The problem you're describing isn't intractable, so it's fairly certain that someone will solve it soon. Most of the brightest minds in society are working on AI in some form now. It's starting to sound trite, but today's AI's really are the worst that AI will ever be.

“ Most of the brightest minds in society are working on AI in some form now.”

Source? I haven’t met one intelligent person working on AI. The smartest people are being ground into dust. They’re being replaced by pompous overconfident people such as yourself.

> I haven’t met one intelligent person working on AI.

I get the impression that you don't meet a lot of people in general.

> 100s of users might have discussed the same/similar technical problem with an LLM, but there's no way (afaik) for the AI to promote good content and demote bad ones, as it (AI) doesn't have the concept of correctness/truth

The LLM doesn't but reinforcement does. If someone keeps asking the model how to fix the problem after being given an answer, the answer is likely wrong. If someone deletes the chat after getting the answer, it was probably right.

AI is an entropy machine.

Those AI prompts that become data for the AI companies is yet another thing that the human creators used to understand what people wanted, topics to explore, feedback on what they hadn't communicated well enough. That 'value' is AI stealing yet more energy from the system resulting in even less/less valuable human creation.

> If the argument is sustainability of training,

that is the argument, yes.

Claude clearly got an enormous amount of its content from Stackoverflow. Which has mostly ceased to be a source of new content. However unlike the author I dont see any way to fix this; stackoverflow was only there because people had technical questions that needed answers.

Maybe if the LLMs do indeed start going stale as there's not enough training data for new technologies, Q&A sites like Stackoverflow would still have a place, since people would still resort to asking each other questions rather than LLMs that dont have training data for a newer technology.