Hacker News new | ask | show | jobs
by adamtaylor_13 13 days ago
The argument that LLMs are "feeding me back free and open internet" seems to skip the most useful aspect of the tool.

I could never, as an individual read, let alone synthesize and make decisions with, the amount of information on the internet. The LLM takes that free and open information and feeds me back novel information based on that free information. It gives me ideas, opinions, and hard data based on that information.

It's the most powerful information synthesizing tool in existence. I don't find the argument that "it's built on free information and sold to you" fair or plausible at all.

It's like saying you're free to make your own bottled water. Technically true, but in reality not.

3 comments

I think you undervalue the contribution of internet-scale data to foundation modeling, and because LLMs can obsolete the content they required, I think its fair to characterize it as theft. Obviously RL contributes a lot to capabilities, but the judgement that an LLM uses to 'synthesize information' is born from the training data. The scale of the data really is beyond intuition. books3, for example, would 230 yrs of continuous reading

I actually think the "proprietary non-determenistic database of the free internet" does a lot to characterize the capabilities and effects to a lot of people. Obviously coders are more in tune with how well agents can work, but that's also due more to the RL breakthroughs than foundation modeling.

As I understand RL makes foundation models stupider (less capable, not more) but better at following instructions.
Can you steal something that is free and openly available?

I just don't understand this argument. "Theft" feels like a nice, heavy, moral accusation to toss at those you're debating with, but the actual prerequisites for theft don't even exist in this situation.

It is a lot more complicated than that. Your content is not simply used, copied, or even just simply distributed. The very terrain that you produce, distribute, represent your content has shifted due to the mechanics of it. Anything you produce is grabbed into AI summaries. They're grabbed into the training data. Humans produce free/open materials for many reasons. A lot of them don't have room to breathe and gain structure due to AI siphoning the entire atmosphere of web; eg communities
I mean, not that I'm a huge fan of IP laws, but yes?

Like I said, if you provide an alternative to all these blogs and forums (because you trained on them or because you scrape them for RAG) then you are stealing their traffic. Search engines were/are already doing that, but the foundation training

It’s the solution to the second information problem. Hypertext arose from Bush’s Memex, and the information problem it offered to solve. Now, there is simply so much information available on the modern Memex that it is impossible to make any sense of it all. So, we now have LLMs. There are still some issues with them, but they’re good at what they do.

I have mixed emotions about LLMs and AI more generally. I fear the dystopia, hope for some marginal improvement in human life, and I genuinely enjoy playing around with local models. But, I think there may be near term harms that outweigh the gains. We shall see.

Nothing about the information it feeds you is novel. It's all stolen repetition of someone else's work.
Bizarre to say that. When I have it perform work on a bespoke code base on a niche videogame, in a less commonly used language, is that still "regurgitating stuff"?

No, it is impossible for it to have seen this combination of things.

It routinely produces, suggests, and correctly implements novel things that had not existed.

You can see this yourself by learning how LLMs work, or anecdotally using these tools.

LLMs are terrible at generating code for “less commonly used languages”. They require LOTS of data for high accuracy.

I describe it this way: they are good at interpolating from what data they were trained on, but terrible at extrapolating. I agree with the parent that the LLM-generated content isn’t novel, it’s just a rehash of two things it was trained on.

I have wasted quite a number of hours trying to use LLMs to write things for less common languages. Sure they can one-shot some impressive stuff in C#, Python, and JavaScript… but try working in Object Pascal: it’s non-obvious hallucination after non-obvious hallucination, presented confidently enough to make it difficult to see it’s complete garbage, so you waste a ton of time trying to polish a turd.
yet i’ve written a language using an LLM, of which there can be no prior knowledge since it’s new, and it can write that code just fine.

it’s all about context.

Creating a new paradigm is a problem with a lot more groundwork laid that working in an existing little-known paradigm. One is creating patterns which only have to be good to be correct. The other has to be correct to be good. They are completely different problems.
That is simply not true. The naive “glorified auto-complete / stochastic parrot” argument may have some merit when applied to generic pre-trained models, which only learn from unsupervised next-token prediction. But the post training through reinforcement learning the frontier models undergo is very sophisticated and they genuinely learn to do novel things that are purely the work of the model being trained (and the work of the GPUs they burn along the way of course).
Thank god I bought the alphabet before learning it unlike one of those stealing heathens.

In your hate of AI please don't build the world in The Right to Read.

I'm certain I've read this comment before.