Hacker News new | ask | show | jobs
by devonkim 2769 days ago
So one thing I did for a while was to take content from porn sites, extract key frames and any available images, and look for SFW images. Hard problems include detecting NSFW stuff that includes no nudity / genitalia (bodily fluids and solids) and correcting skin tone (most models being trained on primarily Caucasian and Asian performer data had trouble with darker skin tones). Some previous research showed that the trained CNNs were looking very hard for lipstick, so adding in samples from performers with less contrast on the lips was also important for training purposes. I didn’t notice anything terribly different when training with transgender performers (hotdog / not hot dog is very easy from an object detection basis) but I had to be sure that there wasn’t confusion that a human could have that would introduce bias into the model. Another big plus with porn sites is that your data is already tagged by its users and they are checked aggressively for accuracy.

My point is really that image searches can only get you so far and that biases are abound in casual NSFW searches to the extent you may need to curate your own data sets that look like they could be on a random porn site in ANY section. Finding an appropriate training set almost reminds me of jury selection processes.