Hacker News new | ask | show | jobs
by jononor 1907 days ago
In the typical ML task formulation, closed-set classification, the model is forced to make a choice between the provided categories, even if the output is not really either. And neural networks tend to perform erratically outside of the training data distribution.

Adding common inputs to the training (or at least validation and test) sets is a good solution. Its hard data work, but will pay off. There are some techniques outside of closed-set classification that can help reduce the problems, or make the process of improving it more effective:

- Couple the classifier with a out-of-distribution (novelty/anomaly) detector. Samples that score high are considered "Unknown" and can be flagged for review. - Learn a distance metric for "nudity" instead of a classifier, potentially with unsupervised or self-supervised learning (no labels needed). This has higher chance of doing well on novel examples, but it still needs to be validated/monitored. - Use one-class classifier, trained only on positive samples of nudity. This has the disadvantage that novel nudity is very likely to be classified as "not nudity", which could be an issue.