|
|
|
|
|
by pona-a
749 days ago
|
|
That's what model interpretability research is. You can train an interpretable model from the uninterpretable teacher, you can look at layer activations and how they correspond to certain features, or apply a hundred other domain-specific methods depending on your architecture. [0] Sadly, insight is always lost. In a noisy world where even with the best regularization, some fitting on it, or higher order features that describe it, is inevitable for maximizing prediction accuracy, especially if you don't have the right tools to model it (like transformers adapting to lacking registers [1]) and yet a lot of parameters within chosen architecture. What's worse, bad expectations are often much worse than none. If your loan had been denied by a fully opaque black box, you may be offered recourse to get an actual human on the case. If they've trained an interpretable student [2], either by intentional manipulation or by pure luck, it may have obscured the effect of some meta-feature likely corresponding to something like race, thus whitewashing the stochastically racist black box. [3] [0] "Interpretability in ML: A Broad Overview" https://www.lesswrong.com/posts/57fTWCpsAyjeAimTp/interpreta...
[1] "Thread: Circuits" https://distill.pub/2020/circuits/
[2] "Why Should I Trust You?": Explaining the Predictions of Any Classifier" https://arxiv.org/abs/1602.04938
[3] "Fairwashing: the risk of rationalization" https://proceedings.mlr.press/v97/aivodji19a |
|
I think having multiple layers of abstraction can be really useful and have done it myself for some agent-based models with high levels of complexity. In some sense, these approaches can also be thought of as "in-silica experiments".
You have a model that is complex and relatively inscrutable, just like the real world, but unlike the real world, you can run lots of "experiments" quite cheaply!