|
|
|
|
|
by ajratner
2164 days ago
|
|
Lurking for a few more min... great question! Dealing with both class imbalance and issues of pernicious biases in both underlying data distributions and training labels is an extremely important topic. Our underlying theory deals with local biases (e.g. individual labeling functions or sources of training signal being biased) but systemic biases (e.g. the user driving the system being biased) are certainly tougher. One important and practical answer that we've found: with an approach like in Snorkel Flow, you can inspect the source of the training data and correct it if biased- which you just can't do with e.g. a million hand labeled training data points. So in practice this is a big advantage we've found. On the theory / research side, this is definitely an area we want to pursue further! |
|
I'm imagining a dumb example like recipes, where "1 tsp salt" is a common format for ingredient. I'd imagine that the majority of ingredients follow that format, so it'd be a natural function to write. I'd also imagine that there's a correlation between following that format and being a recipe with a european background.
Generalize that a little bit, and almost by definition the simplest N rules that get the most coverage will cover the majority cases best. Being outside the majority cases is probably correlated with most "human issues," defined however you want. Being an artifact of the properties of what the simplest N rules cover, I'm not clear whether it'd be defined as local or systemic in the sense you've worked on.
I'm curious whether this falls under the theory you've worked on already or the theory you're talking about pursuing in the future. If it's something you've worked on already, I'd be very interested in reading what you have.