Will be interesting to see a decade from now how researchers collect a corpus that isn’t chock full of a model’s own output