| HN Mirror

That's what model interpretability research is. You can train an interpretable model from the uninterpretable teacher, you can look at layer activations and how they correspond to certain features, or apply a hundred other domain-specific methods depending on your architecture. [0]

Sadly, insight is always lost. In a noisy world where even with the best regularization, some fitting on it, or higher order features that describe it, is inevitable for maximizing prediction accuracy, especially if you don't have the right tools to model it (like transformers adapting to lacking registers [1]) and yet a lot of parameters within chosen architecture.

What's worse, bad expectations are often much worse than none. If your loan had been denied by a fully opaque black box, you may be offered recourse to get an actual human on the case. If they've trained an interpretable student [2], either by intentional manipulation or by pure luck, it may have obscured the effect of some meta-feature likely corresponding to something like race, thus whitewashing the stochastically racist black box. [3]

[0] "Interpretability in ML: A Broad Overview" https://www.lesswrong.com/posts/57fTWCpsAyjeAimTp/interpreta... [1] "Thread: Circuits" https://distill.pub/2020/circuits/ [2] "Why Should I Trust You?": Explaining the Predictions of Any Classifier" https://arxiv.org/abs/1602.04938 [3] "Fairwashing: the risk of rationalization" https://proceedings.mlr.press/v97/aivodji19a