| > If I were to start with ChatGPT I would, necessarily, only say what has been said. This is not true though, you can test it for yourself: Prompt: Invent a new word, which has never been written before. When I type it in Google, I should get zero results. I'm completely transparent, I tried this query 3 times because of my own curiosity and will provide all 3 answers (I added --- for formatting): --- ChatGPT (answer 1): Sure, how about "flumtraz"? --- ChatGPT (answer 2): Here's a new word: "Ardidio" It is a combination of "ardor" (meaning passion or enthusiasm) and "audio" (meaning sound or hearing). The word Ardidio can be used to describe a strong and passionate feeling one experiences when listening to music or any form of audio. For example, "I felt a rush of ardidio while listening to the live performance of my favorite band." As of now, when you search for "Ardidio" on Google, there are no results found. --- ChatGPT (answer 3): "Flumixen" - a verb that means to mix or blend something in a clumsy or awkward manner. For example, "I tried to make a cake from scratch, but I ended up flumixening the ingredients together and it turned out terrible." --- For the first answer I got zero results on Google, so it's quite unlikely that it was part of the training data, proving that your theory wrong. It can say things that have never been said. For the second and third answer I got results in Google, but they seem accidental, usernames and the likes, but as far as I can see, they have never been used with the meaning that ChatGPT "invented". I'm aware that ChatGPT at its core is a stochastic parrot, but the evidence is clear for me that it can combine certain aspects of its training data in unique ways, which were never done before. |
Sure, not literally part of the training data.
Statistical AI operates in a transformed space derived from the training data, points in that space will not, in general, exist in the original.
So imagine generating 1000 circles and putting their radii on a line: 0.1, 0.2, 0.3, ...
The circles are the training data, and the "implied line" is the transformed space.
Now, AI here is capable of generating a circle with radius 0.15 and hence that circle is "not in the original dataset".
This type of "novelty" isn't what I'm concerned with; generative AI must have that or else it'd be entirely useless -- only a google search.
Rather i'm talking about, for example, whether without "Rust" in its training data it could develop "Rust" from everything else. Is there enough data on lifetimes/borrowing/etc. research in pdfs that it's scanned to somehow "find a midpoint between those pdfs and C++".
It seems a bit mad to suppose so -- but I could be wrong, such a midpoint does exist --- but i'm extremely doubtful we humans have been so helpful as to write the 1000s of academic PDFs needed for this system to find it.
The novelty I'm talking about is dimensions in the transformed space. The system cannot derive "additional ways to move" without the source data actually containing those ways.
This is, roughly, equivalent to saying that it's biased towards the on-average ways we have conceptualised our problems as represented by the on-average distribution of academic articles, github repos, webpages, etc. *that we happened to have created*.
This is a serious "intellectually conservative" bias.
For sure it can find circles it hasnt seen; but could it find spheres from circles alone? No.