|
You're right that stuff is quite difficult. I write a Firefox addon (https://addons.mozilla.org/en-US/firefox/addon/wingman-jr-fi..., https://github.com/wingman-jr-addon/wingman_jr) and train an associated NSFW model (https://github.com/wingman-jr-addon/model) - I've been at it for a few years now, and have had to plug many specific edge cases. - Babies (https://github.com/wingman-jr-addon/wingman_jr/issues/22)
- Beach volleyball (but this definitely has SFW and NSFW variants, based on a somewhat subjective line)
- Athletes in general. The model particularly thought some American football players were NSFW for a long time.
- Swimming
- Yoga - again, most SFW and some NSFW here but it still struggles
- Wrestling was a tough one for sure
- Pokemon
While indeed tough, I've seen definite progress. So it's not just a matter of tech, but also of considering the human element - the state of the art may not be up to the challenge of perfection, but it is definitely up to a point of true utility for some use cases. I'm happy about that.As a note, it uses an EfficientNet Lite L0 backbone - I'm a bit limited in what type of scanning I can perform in a sufficiently speedy manner. I also agree on the context for sure - one reason I haven't tried switching to an object detection method (and that I don't rely heavily on truly random crops) is that the focus of the image is highly important for the NSFW-ness in some cases. True, two images may contain the same content ... but one is far worse than the other. The nature of CNN's still has some of this location-invariance baked in, but I don't want to exacerbate it. One challenge I think the OP may run into here that may also not be immediately obvious is that accuracy on image stills does not translate that well to video. I have basic video support in my addon, and while I knew there would be some differences, I was surprised at how many discrepancies there really are. As two examples: - Images in video are often blurrier. In true still images, there is a somewhat higher prior involved with amateur NSFW content and blurriness. This can be a source of false positives.
- The opposite of the note above about focus. Taking stills of moving images will have many transitory frames that seem inappropriate on their own because it seems as if they are focusing on something when in reality the camera is just panning - obvious to the human, less so to the model trained on stills.
At any rate, given how well your list of edge cases coincided with failures I've grappled with, I'd be interested to see how well you think my addon stacks up for still images when set to stay in "normal" mode. I'd love to hear any feedback you have via GitHub so I can make it better. |
One of the technical issues that you pointed out, is that a model trained on still images, shouldn't be expected to work on video. While I did not train a custom model for this this project, I'm current working on another DNN model for a completely different purpose, where I think feeding frame deltas into the model, will improve the outcome.
As a hobbyist, I would reckon for porn and the like, analysing frozen frames is probably just enough. For violence however, I would agree with you and say that some effort to encode motion would be essential.
Focussing on NSFW content generally, I would guess, depending on the scale of your project, that you will forever run into 'edge cases' for NSFW images, even before you run into the soft wall of subjectivity.
I agree that the tech is improving all the time, and I think something like this can be made to be truly useful one day. Possibly soon. But it would need a large, active development team, a great deal more compute, and a LOT of data. In much the same way that no home/garage coder can hope to put together a model like GPT3 right now, I would think that a foolproof NSFW classifier would need more resources than you or I have access to at this moment.
But things change all the time.
Thinking about what you're doing, one thing I might suggest, if you have time to develop it, is to add some kind of 'recording' mechanism to your plugin, so that the users themselves can add to your dataset... But you have to wonder how many users will allow that! XD
I'm also wondering if a Firefox extension is the best place for your model? To that end, I would suggest putting the app on a server (which is what I originally wanted to do with my hack) which will give you the opportunity to crowdsource data collection. People might be more willing to volunteer data in that way (in a similar way to how people use https://builtwith.com/).
You're also very welcome to take the UX work I've done on this opensource project (because this hack was ultimately just a UX experiment), and plug your model in. If your model and trained weights are available, I'd like to try and create a branch myself, if I have time.
Also, as hobbyist building knowledge, I hadn't heard of `EfficientNet Lite` before. I'd been considering Darknet - https://pjreddie.com/darknet/ for embedded stuff until reading your post.