Hacker News new | ask | show | jobs
by ianhorn 2164 days ago
Thanks for the answer! The local vs systemic bias thing is particularly interesting for a system like snorkel. I have a clarifying question.

I'm imagining a dumb example like recipes, where "1 tsp salt" is a common format for ingredient. I'd imagine that the majority of ingredients follow that format, so it'd be a natural function to write. I'd also imagine that there's a correlation between following that format and being a recipe with a european background.

Generalize that a little bit, and almost by definition the simplest N rules that get the most coverage will cover the majority cases best. Being outside the majority cases is probably correlated with most "human issues," defined however you want. Being an artifact of the properties of what the simplest N rules cover, I'm not clear whether it'd be defined as local or systemic in the sense you've worked on.

I'm curious whether this falls under the theory you've worked on already or the theory you're talking about pursuing in the future. If it's something you've worked on already, I'd be very interested in reading what you have.

1 comments

Great question! Let me rephrase so you can confirm my understanding: I have some labeling functions (LFs) that are far more accurate on a majority subset of the data than on one or more minority subsets or "slices" of the data... and these subsets are not necessarily correlated with the class labels, so this isn't a traditional class imbalance problem...

We've actually done some recent work on this (https://papers.nips.cc/paper/9137-slice-based-learning-a-pro...) where we have users define these critical "slices" approximately so that the model being trained can pay special attention to them (extra representation layers) so they don't get drowned out by the majority subsets/slices. But definitely a lot more to do in this area!

Cool idea, and thanks for the answer! I'll have to look more closely at the paper :)