Hacker News new | ask | show | jobs
by rhdunn 1177 days ago
That will also remove:

1. medical pages/docs using the medical terms anus, rectum, nipple, and semen (note that other medical terms are not on that list).

2. pages/docs using "sex" to refer to males and females.

3. pages/docs talking about rapeseed oil or the plant it comes from (https://en.wikipedia.org/wiki/Rapeseed_oil).

The big problem with these lists is that they exclude valid contexts, and only include a small set of possible terms, so the model would get a distorted view of the world (like it learning that people can have penises, vaginas, breasts, but not nipples or anuses, and breasts cannot be big [1]). It would be better to train the models on these, teach it the contexts, and teach it where various usages are archaic, out dated, old fashioned, etc.

[1] but this is excluding the cases where "as big as", etc. are used to join the noun from the adjective, so just excluding the term "big breasts" is ineffective.

1 comments

This is what's known as the Scunthorpe problem. https://en.wikipedia.org/wiki/Scunthorpe_problem
I was thinking of that, but I think that while it's in the same vein, there's also an additional problem.

Apart from that list missing non-English words, leet, and emoji, there are also plenty of words which can be innocent or dirty depending entirely on context: That list doesn't have "prick", presumably because someone read about why you're allowed to "prick your finger" but not vice versa.

Regarding Scunthorpe, looking at that word list:

> taste my

It's probably going to block cooking blogs and recipe collections.