| HN Mirror

Some work, yeah. It's still an open problem to do it well, but I think the folks at Anthropic have made a reasonable start with their work[1] on influence functions ("tracing model outputs to the training data"). Basically their work attempts to answer the question "what particular training data most strongly influenced the model to give the answer it did", by doing some fancy math that I think is equivalent to taking the gradient produced by each piece of training data, computing the derivative of loss on the output of interest as the gradient is applied to the model, and then using that as the answer.

Though it sounds like even their much cheaper clever approach is still very expensive.

[1] paper at https://arxiv.org/abs/2308.03296, post at https://www.anthropic.com/index/influence-functions