|
Given that you can effectively identify and reformulate biased content, the most low effort method being the use of multiple updated prompts, I count it a feature that the model contains a sub model of racist perspectives. If I were to ask you to compose a horribly offensive racist sentence, I am all but certain you could construct something that would be utterly shocking. You yourself have a model of biased, sexist, racist perspectives, and part of being a good human is recognizing and using that as proxy for what not to think or do or say. If you're at all self aware, you can compare your thoughts and say "oh, that sounds like something a racist might say, let's reconsider whatever knowledge that led me to think that way. " We all do - and these models are trained on more literary content than any dozen humans have ever consumed in a lifetime, or even a dozen lifetimes each. Removing the cruft, the chaos, and noise might be valuable, but if you want a generally capable model that can parse a huge spectrum of the human experience, that means taking the bad with the good. It's far more likely than not that the current state of the pile is not ideally curated, even with the intent of keeping the "bad"stuff, but i hope that becomes a consideration for them as they develop it further. There's a nietszchean abyss aspect to these giant models - you don't want them trained significantly on the horrible and evil, but enough of it to foster a nuanced and deep understanding of human nature can't help but be valuable. A naive, ignorant, childlike model is less valuable than one aware of the darker nature of humans at their worst. Fine tuning on a 4chan dump might be highly effective for identifying trolling. Scraping neonazi and extremist and cult sites and forums could also allow these models to easily identify the content and contexts of such worldviews. Prompt engineering is in its infancy, but will allow developers to engineer semantically in a way that's explicable through each prompt down to the level of the model. You could encounter a Twitter user attempting to post an argument, and they encounter a popup that says sobering like: "you're not allowed to post this. Unfortunately, it makes a straw man argument that leads to the impression that it could be reasonable to be biased against a person based on their race. Here is the exact reasoning: because blah, blah, blah. " If a user challenges it, kick it to a human, and they can recommend improving the prompts by pointing out where it went wrong, or validate the preemptive moderation. Nothing like that would be possible without training on biased and racist content. That does mean the tool must be used responsibly, in full awareness that you need to be vigilant against bias, intentional or not. I'd rather it be like that than a naive, faux innocent model, unaware and less capable. |