|
|
|
|
|
by Imnimo
1971 days ago
|
|
I guess the way I look at it is that if you can automatically label your training set, you either have solved the problem you set out to learn (just use your labeler as your classifier/detector/whatever), or you're exploiting some limitation of the training set. Given a human-annotated test set, I'd want to see a comparison between three outputs: -The outputs of the auto-labeler. If this is strong, you've learned that you didn't need the training set after all - you managed to solve the problem without it! -The outputs of a model trained on auto-labeled data. If this is strong but the above test was not, then this pipeline makes sense. -The outputs of a model trained on human-labeled data. If this is strong but the above tests were not, we're in trouble. If none of the three are strong, then the training data was lacking (assuming we've done our best on tuning the model we're trying to train), and so no real value was gained by annotating it. |
|
When you are labelling data, you have access to strategies and means that might not be available to your downstream model. In our experience this includes a human in the loop component, building non-robust ensemble models(we call these micro-models), and some "guess work" functions on the data. All of this together can make an "auto labeller" that does pretty well getting labels made, but really the sum of these strategies is very different from some well trained neural network that will be running on edge or whatever.
The point of a model is not to label the data, it's to generate some value in some out of sample task, quite different from strategies that you can run in a sandboxed environment with your training data.