| Exactly. This was the type of mistake that OpenAI could easily have made. I could see myself including this historical dataset without giving it a second thought. After all, the more data, the better, right? One of The Pile's goals was to point out how tricky that can be. We've all seen how effortlessly Copilot spits out GPL code by rote; one wrong prompt would be all it takes to start spewing a lot things that no one wants to hear, if you have the wrong sort of data. When you train with The Pile, you know exactly what you're getting, because you can take whatever parts you want and ignore the rest. It's a modular dataset. But defaults still matter -- by default, everyone will train on everything. Maybe OpenAI trained on the wrong thing, and maybe that's why they're forcing everyone to use their filters now. Whereas people can "just go train on everything in The Pile" and not have to worry. (Once upon a time, the plan was to include a dump of Literotica in The Pile, which you can still find here: https://the-eye.eu/public/AI/pile_preliminary_components/ I argued heavily in favor of this, and thought it was totally lame when they decided to drop it. In hindsight, that was a close call. AI Dungeon proves that it's easy to carelessly include things that can bite you later: https://gitgud.io/AuroraPurgatio/aurorapurgatio#aurorapurgat... Maybe some people want their models to include that sort of thing, but it shouldn't be the default. People shouldn't have to worry that the defaults will be "Whoa, I only wanted to make a Q&A system for my business; why is it reciting love poems?" Stella saw that, I think. I didn't. |