Hacker News new | ask | show | jobs
by elandau25 1971 days ago
Hi Imnimo, I wrote the article and definitely understand your concerns. The point is not the specific steps I took working in general for most datasets, but more the overall idea of using a more data science-y approach to labelling rather than just blindly throwing your data at a workforce.

A more varied dataset will require additional strategies. We have done this type of thing with various datasets and what normally works is a combination of some vertical models, heuristics specific to the dataset, classical computer vision techniques, and some human label seeding/correction.

1 comments

I guess the way I look at it is that if you can automatically label your training set, you either have solved the problem you set out to learn (just use your labeler as your classifier/detector/whatever), or you're exploiting some limitation of the training set. Given a human-annotated test set, I'd want to see a comparison between three outputs:

-The outputs of the auto-labeler. If this is strong, you've learned that you didn't need the training set after all - you managed to solve the problem without it!

-The outputs of a model trained on auto-labeled data. If this is strong but the above test was not, then this pipeline makes sense.

-The outputs of a model trained on human-labeled data. If this is strong but the above tests were not, we're in trouble.

If none of the three are strong, then the training data was lacking (assuming we've done our best on tuning the model we're trying to train), and so no real value was gained by annotating it.

I see it a bit differently. I see it as two separate(but correlated) tasks. There is labelling the data and building a robust model. There is a nuanced gap between the two. The labelling task and the model task live a different constraint space.

When you are labelling data, you have access to strategies and means that might not be available to your downstream model. In our experience this includes a human in the loop component, building non-robust ensemble models(we call these micro-models), and some "guess work" functions on the data. All of this together can make an "auto labeller" that does pretty well getting labels made, but really the sum of these strategies is very different from some well trained neural network that will be running on edge or whatever.

The point of a model is not to label the data, it's to generate some value in some out of sample task, quite different from strategies that you can run in a sandboxed environment with your training data.

Sure, but you need to demonstrate that the auto-labeled training data is valuable by showing that a model trained on it performs as well (or close to as well) as the same model trained on human-labeled data. Without that, we're just eyeballing the auto-labels and saying "looks good I guess!"

Obviously we should expect that the auto-labeler fails on the test set, because we assume we're exploiting some convenience that won't be available at test time. But we should still try - it might reveal that our task is too easy to need the model we were planning to train, or it might reveal that our test set is not actually representative.

Yea, so that's more of a comment on the accuracy of the auto generated labels, because this approach doesn't assume a different representative set of data than with human labelled data, just that less of the data is human labelled.

So it comes down to how good the auto generated labels are(from a human perspective), which is a fair point that I didn't address much in the article, but in general comes down to a good QA process(which is applied to both human labels and machine labels equally because humans also make mistakes in this stuff).

In the article the dataset was small enough and the labels simple enough that I could run very quick visual inspection over the results, but for more complicated tasks we have a more rigorous human review process for evaluating label accuracy(again to both human and algorithm produced labels). The auto generated labels may not be more efficient overall if they require a lot of correction after review, but for this case, and a lot of other ones, they just are empirically are.