In another adjacent thread, people are talking about the implications of a neural network conforming to the training data with some error margin with regards to copyright.
Many textbooks on information theory already call out the content-addressable nature of such networks[1], and they’re even used in applications like compression due to this purpose[2][3], and therefore it’s no surprise that the NYT prompting OpenAI models with a few paragraphs of their articles reproduced them nearly verbatim.
Yes! This is a consequence of empirical risk minimization via maximum likelihood estimation. To have a model not reproduce the density of data it trained on would be like trying to get a horse and buggy to work well at speed, "now just without the wheels this time". It would generally not necessarily go all that well, I think! :'D