|
|
|
|
|
by cratermoon
1177 days ago
|
|
"The Colossal Clean Crawled Corpus, used to train a trillion parameter LM in [43], is cleaned, inter alia, by discarding any page containing one of a list of about 400 “Dirty, Naughty, Obscene or Otherwise Bad Words”. This list is overwhelmingly words related to sex, with a handful of racial slurs and words related to white supremacy (e.g. swastika, white power) included. While possibly effective at removing documents containing pornography (and the associated problematic stereotypes encoded in the language of such sites) and certain kinds of hate speech, this approach will also undoubtedly attenuate, by suppressing such words as twink, the influence of online spaces built by and for LGBTQ people. If we filter out the discourse of marginalized populations, we fail to provide training data that reclaims slurs and otherwise describes marginalized identities in a positive light" from "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? " https://dl.acm.org/doi/10.1145/3442188.3445922 That list of words is https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and... |
|
1. medical pages/docs using the medical terms anus, rectum, nipple, and semen (note that other medical terms are not on that list).
2. pages/docs using "sex" to refer to males and females.
3. pages/docs talking about rapeseed oil or the plant it comes from (https://en.wikipedia.org/wiki/Rapeseed_oil).
The big problem with these lists is that they exclude valid contexts, and only include a small set of possible terms, so the model would get a distorted view of the world (like it learning that people can have penises, vaginas, breasts, but not nipples or anuses, and breasts cannot be big [1]). It would be better to train the models on these, teach it the contexts, and teach it where various usages are archaic, out dated, old fashioned, etc.
[1] but this is excluding the cases where "as big as", etc. are used to join the noun from the adjective, so just excluding the term "big breasts" is ineffective.