|
|
|
|
|
by Al-Khwarizmi
2428 days ago
|
|
This surprised me a bit, on the creation of the corpus they use for training: "We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”." I don't understand this decision. This list contains words that can be used in a perfectly objective sense, like "anus", "bastard", "erotic", "eunuch", "fecal", etc. I can understand that they want to avoid websites full of expletives and with no useful content, but outright excluding any website with even one occurrence of such words sounds too radical. If we ask this model a text comprehension question about a legitimized bastard that inherited the throne, or about fecal transplants, I suppose it would easily fail. Strange way of limiting such a powerful model. |
|