|
|
|
|
|
by elandau25
1971 days ago
|
|
I see it a bit differently. I see it as two separate(but correlated) tasks. There is labelling the data and building a robust model. There is a nuanced gap between the two. The labelling task and the model task live a different constraint space. When you are labelling data, you have access to strategies and means that might not be available to your downstream model. In our experience this includes a human in the loop component, building non-robust ensemble models(we call these micro-models), and some "guess work" functions on the data. All of this together can make an "auto labeller" that does pretty well getting labels made, but really the sum of these strategies is very different from some well trained neural network that will be running on edge or whatever. The point of a model is not to label the data, it's to generate some value in some out of sample task, quite different from strategies that you can run in a sandboxed environment with your training data. |
|
Obviously we should expect that the auto-labeler fails on the test set, because we assume we're exploiting some convenience that won't be available at test time. But we should still try - it might reveal that our task is too easy to need the model we were planning to train, or it might reveal that our test set is not actually representative.