Hacker News new | ask | show | jobs
by dijksterhuis 527 days ago
as a human being, and one that does music stuff, i don’t download terabytes of other peoples works from the internet directly into my brain. i don’t have verbatim reproductions of people’s work sitting around on a hard disk in my stomach/lungs/head/feet.

LLMs are not humans. They’re essentially a probabilistic compression algorithm (encode data into model weights/decode with prompt to retrieve data).

1 comments

Do you ever listen to music? Is your music ever influenced by the music that you listen to? How do you imagine that works, in an information-theoretical sense, that fundamentally differs from an LLM?

Depending on how much music you've listened to, you very well may have "downloaded terabytes" of it into your brain. Your argument is specious.

Information on how large language models are trained is not hard to come by, there are numerous articles that cover this material. Even a brief skimming of this material will make it clear that the training of large language models is materially different in almost every way from how human beings "learn" and build knowledge. There are still many open questions around the process of how humans collect, store, retrieve and synthesize information.

There is little mystery to how large language models function and it's clear that their output is parroting back portions of their training data, the quality of output degrades greatly when novel input is provided. Is your argument that people fundamentally function in the same way? That would be a bold and novel assertion!

> There is little mystery to how large language models function and it's clear that their output is parroting back portions of their training data

If this were true, then you would be able to identify the specific work being "parroted" and you'd have a case for copyright infringement regardless of whether it was produced by an LLM at all. This isn't how LLMs work though. For instance, if an LLM's training data includes the complete works of a given author and then you prompt the LLM to write a story in the style of that author, it will actually write an original story instead of reproducing one of the stories in its training corpus. It won't be particularly good but it will be an original work.

It also isn't obvious whether or not, or to what degree, LLM training works differently from human learning. You yourself acknowledged that there are "many open questions" about how human learning works, so how can you be so confident that it's fundamentally different? It doesn't matter anyway because you can still apply the exact same standards to LLM output to judge whether it infringes copyright that you would to something that was produced by a human being.

i do listen to music.

i listen to it on apple music.

i pay money to apple for this.

some of that money that i pay to apple goes to the rights holders of that music for the copying and performance of their work through my speakers.

that’s a pretty big difference to how most LLMs are trained right there! i actually pay original creators some money.

-

i am a human being. you cannot reduce me down to some easy information theory.

an LLM is a tool. an algorithm. with the same random seed etc etc it will get the same results. it is not human.

you put me in the same room as yesterday i’ll behave completely differently.

-

i have listened to way more than terabytes of music in my life. doesn’t mean i have the ability to regurgitate any of it verbatim though. i’m crap at that stuff.

LLMs seem to be really good at it though.