|
|
|
|
|
by craffel
2430 days ago
|
|
Hi, one of the paper authors here. Indeed this is a good question. A couple of comments: - Common Crawl overall is a sparse web dump, it is unlikely that the month we used includes any of the data that are in any of the test sets. - In order for the data to be useful to our model, it would have to be in the correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the label in a format our model could extract meaning from. We introduced this preprocessing format so I don't believe this would ever happen. - Further, most of these datasets live in .zip files. The Common Crawl dump doesn't unzip zip files. - C4 is so large that our model sees each example (corresponding to a block of text from a website) roughly once ever over the entire course of training. Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps. |
|
I am not so sure about that. Have you seen this thread: https://www.reddit.com/r/MachineLearning/comments/dfky70/dis...
Apparently lots of sentence fragments were memorized in GPT-2 (including real world URLs, entire conversations, username/emails and other PII).