Hacker News new | ask | show | jobs
by ajratner 2164 days ago
Lurking for a few more min... great question! Dealing with both class imbalance and issues of pernicious biases in both underlying data distributions and training labels is an extremely important topic. Our underlying theory deals with local biases (e.g. individual labeling functions or sources of training signal being biased) but systemic biases (e.g. the user driving the system being biased) are certainly tougher.

One important and practical answer that we've found: with an approach like in Snorkel Flow, you can inspect the source of the training data and correct it if biased- which you just can't do with e.g. a million hand labeled training data points. So in practice this is a big advantage we've found.

On the theory / research side, this is definitely an area we want to pursue further!

1 comments

Thanks for the answer! The local vs systemic bias thing is particularly interesting for a system like snorkel. I have a clarifying question.

I'm imagining a dumb example like recipes, where "1 tsp salt" is a common format for ingredient. I'd imagine that the majority of ingredients follow that format, so it'd be a natural function to write. I'd also imagine that there's a correlation between following that format and being a recipe with a european background.

Generalize that a little bit, and almost by definition the simplest N rules that get the most coverage will cover the majority cases best. Being outside the majority cases is probably correlated with most "human issues," defined however you want. Being an artifact of the properties of what the simplest N rules cover, I'm not clear whether it'd be defined as local or systemic in the sense you've worked on.

I'm curious whether this falls under the theory you've worked on already or the theory you're talking about pursuing in the future. If it's something you've worked on already, I'd be very interested in reading what you have.

Great question! Let me rephrase so you can confirm my understanding: I have some labeling functions (LFs) that are far more accurate on a majority subset of the data than on one or more minority subsets or "slices" of the data... and these subsets are not necessarily correlated with the class labels, so this isn't a traditional class imbalance problem...

We've actually done some recent work on this (https://papers.nips.cc/paper/9137-slice-based-learning-a-pro...) where we have users define these critical "slices" approximately so that the model being trained can pay special attention to them (extra representation layers) so they don't get drowned out by the majority subsets/slices. But definitely a lot more to do in this area!

Cool idea, and thanks for the answer! I'll have to look more closely at the paper :)