| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ldhough 913 days ago

> They don't regurgitate training data.

While I very much do not think this is all they do, I don't think this statement is correct. Some research indicates that it is not:

https://not-just-memorization.github.io/extracting-training-...

Anecdotally, there were also a few examples I tried earlier this year (on GPT3.5 and GPT4) of being able to directly prompt for training data. They were patched out pretty quick but did work for a while. For example, asking for "fast inverse square root" without specifying anything else would give you the famous Quake III code character for character, including comments.

1 comments

a_wild_dandan 912 days ago

Your examples at best support, not contradict, my position.

1. Repeating "company" fifty times followed by random factoids is way outside of training data distribution lol. That's actually a hilarious/great example of creative extrapolation.

2. Extrapolation often includes memory retrieval. Recalling bits of past information is perfectly compatible with critical thinking, be it from machines or humans.

3. GPT4 never merely regurgitated the legendary fast root approximation to you. You might've only seen that bit. But that's confusing an iceberg with its tip. The actual output completion was on several hundred tokens setting up GPT as this fantasy role play writer who must finish this Simplicio-style dialogue between some dudes named USER and ASSISTANT, etc. This conversation, which does indeed end with Carmack's famous code, is nowhere near a training example to simply pluck from the combinatorial ether.

ldhough 912 days ago

> random factoids

The "random factoids" were verbatim training data though, one of their extractions was >1,000 tokens in length.

> GPT4 never merely regurgitated

I interpreted the claim that it can't "regurgitate training data" to mean that it can't reproduce verbatim a non-trivial amount of its training data. Based on how I've heard the word "regurgitate" used, if I were to rattle off the first page of some book from memory on request I think it would be fair to say I regurgitated it. I'm not trying to diminish how GPT does what it does, and I find what it does to be quite impressive.