Hacker News new | ask | show | jobs
by zawerf 2769 days ago
If the nsfw images were from google search (or are images that could easily be indexed by google), it's not surprising that google is performing well on them. Presumably any low confidence images would've been manually tagged and re-added to their training set already.
1 comments

Hey zawerf, your right. 2 notes.

1) it's very difficult to find nsfw images especially a particular kind like gore or suggestive nudity unless you Google things (which indicates a bigger problem). Maybe the solution is to use Bing (maybe this would cause the same issue in compariosn) or DuckDuckGo. But honestly I think if DuckDuckGo indexed a page, I'm pretty sure Google did as well. You would probably need something off of non indexed website which makes the job significantly harder.

2) even though google has all the images it's still not the best performing NSFW Detector Nanonets is.

If you're struggling on (1), Reddit has all kind of bizarre subreddits which cater to all kinds of images. It's also conveniently (is that fortunate?) well categorised. There are definitely subs for gory photos, non-nudity NSFW images and so on. Reddit is also a great resource for categorised SFW, since there are so many subreddits with active and strict moderation.

It's reasonably easy to scrape: https://github.com/jveitchmichaelis/redditdownloader

They may have changed the API since I wrote that though.

So one thing I did for a while was to take content from porn sites, extract key frames and any available images, and look for SFW images. Hard problems include detecting NSFW stuff that includes no nudity / genitalia (bodily fluids and solids) and correcting skin tone (most models being trained on primarily Caucasian and Asian performer data had trouble with darker skin tones). Some previous research showed that the trained CNNs were looking very hard for lipstick, so adding in samples from performers with less contrast on the lips was also important for training purposes. I didn’t notice anything terribly different when training with transgender performers (hotdog / not hot dog is very easy from an object detection basis) but I had to be sure that there wasn’t confusion that a human could have that would introduce bias into the model. Another big plus with porn sites is that your data is already tagged by its users and they are checked aggressively for accuracy.

My point is really that image searches can only get you so far and that biases are abound in casual NSFW searches to the extent you may need to curate your own data sets that look like they could be on a random porn site in ANY section. Finding an appropriate training set almost reminds me of jury selection processes.