|
That will also remove: 1. medical pages/docs using the medical terms anus, rectum, nipple, and semen (note that other medical terms are not on that list). 2. pages/docs using "sex" to refer to males and females. 3. pages/docs talking about rapeseed oil or the plant it comes from (https://en.wikipedia.org/wiki/Rapeseed_oil). The big problem with these lists is that they exclude valid contexts, and only include a small set of possible terms, so the model would get a distorted view of the world (like it learning that people can have penises, vaginas, breasts, but not nipples or anuses, and breasts cannot be big [1]). It would be better to train the models on these, teach it the contexts, and teach it where various usages are archaic, out dated, old fashioned, etc. [1] but this is excluding the cases where "as big as", etc. are used to join the noun from the adjective, so just excluding the term "big breasts" is ineffective. |