| HN Mirror

Great question - there are currently (likely) tons of limitations to this approach as-is. We're planning on testing this on more capable models (e.g: integrated gradients on Llama2) to see how the relationship might change, but here are some initial thoughts:

1. The perturbation method could be improved to more directly capture long-range dependency information across tokens

2. The scoring method could _definitely_ be improved to capture more nuance across perturbations.

I think what we've found is that there does seem to be a relationship between the embedding space and attributions of LLMs, so the next step would be to figure out how to capture more nuance out of that relationship. This sort of side-steps the question you asked, because honestly we'd need to test a lot more to figure out the specific cases where an approach like this falls short.

Anecdotally - we've seen the greatest deviation between the estimation & integrated gradients as prompt "ambiguity" increases. We're thinking about ways to quantify & measure that ambiguity but that's its own can of worms.