Hacker News new | ask | show | jobs
by Al-Khwarizmi 2428 days ago
This surprised me a bit, on the creation of the corpus they use for training:

"We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”."

I don't understand this decision. This list contains words that can be used in a perfectly objective sense, like "anus", "bastard", "erotic", "eunuch", "fecal", etc.

I can understand that they want to avoid websites full of expletives and with no useful content, but outright excluding any website with even one occurrence of such words sounds too radical. If we ask this model a text comprehension question about a legitimized bastard that inherited the throne, or about fecal transplants, I suppose it would easily fail. Strange way of limiting such a powerful model.

1 comments

They say they removed pages, not websites. Having false positives isn't a problem when you're still left with 750GB of data—quality matters more than slightly higher quantity at that point.
Sorry, I was thinking about pages even though I said websites. Native language interference (typically, we use the same term for pages and websites in my language).

Anyway, my point is not a matter of quantity. The way they're doing it, they have 750 GB of data, but they have exactly zero data that talks about bastards, fecal transplants, etc. So they may have a hard time answering questions about those specific subjects.