|
|
|
|
|
by craffel
2433 days ago
|
|
It actually can be more pernicious than that: https://arxiv.org/abs/1802.08232 However note that the dataset used to train GPT-2 is about 20x smaller than C4. I'm not 100% sure how many times the training set was repeated over the course of training for GPT-2, but it was likely many times. I stand by my statement (that memorization is unlikely with SGD and no repetition of training data) but I would be happy to be proven otherwise. |
|