Hacker News new | ask | show | jobs
by worstestes 1012 days ago
Very interesting research!

Given that you're using cosine similarity of text embeddings to approximate the influence of individual tokens in a prompt, how does this approach fare in capturing higher-order interactions between tokens, something that Integrated Gradients (allegedly) is designed to account for? Are there specific scenarios where the cosine similarity method might fall short in capturing the nuances that Integrated Gradients can reveal?

1 comments

Great question - there are currently (likely) tons of limitations to this approach as-is. We're planning on testing this on more capable models (e.g: integrated gradients on Llama2) to see how the relationship might change, but here are some initial thoughts:

1. The perturbation method could be improved to more directly capture long-range dependency information across tokens

2. The scoring method could _definitely_ be improved to capture more nuance across perturbations.

I think what we've found is that there does seem to be a relationship between the embedding space and attributions of LLMs, so the next step would be to figure out how to capture more nuance out of that relationship. This sort of side-steps the question you asked, because honestly we'd need to test a lot more to figure out the specific cases where an approach like this falls short.

Anecdotally - we've seen the greatest deviation between the estimation & integrated gradients as prompt "ambiguity" increases. We're thinking about ways to quantify & measure that ambiguity but that's its own can of worms.