"The study authors took 36 books and divided each of them into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens would be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word."
So they fed "It takes a great deal of bravery to stand up to our " and the llm responded "enemies, but just as much to stand up to our friends".
They repeated that for every 100 tokens of the entire book. I think lots of fans could do just as well. It's pretty good evidence that the potter books were in the training corpus, but it's not quite what people think when they say an llm has 'memorized' something. It's not like getting even a few pages out of the model.
Genome analysis is also a lossy process that chops the data up into tiny bits, like a newspaper sent through a shredder. We then piece together matching sequences in a sort of puzzle. It's often a relatively inaccurate solution. Then we try to do that again with a different copy of the newspaper sent through a different shredder. And again. A genome might be comprised of 10x reads, 30x reads, 100x reads, with more replications representing higher confidence.
There might be ten million people who have quoted Harry Potter at some point in their blogs or forum posts. There are only so many words in the books.
That issue is different, when web tools were added to gpt4o it would fetch the site, and basically copy paste the text into the answer body. So, you were able to read the content of the site without the site getting the ad impressions. Now the system prompts put a very tight word limit - 25? - on quotes from sites the model visits
https://arstechnica.com/features/2025/06/study-metas-llama-3...