Hacker News new | ask | show | jobs
The AI Safety Puzzle Everyone Avoids: How to Measure Impact, Not Intent (lesswrong.com)
1 points by patrick0d 323 days ago
1 comments

I am an AI interpretability researcher and have a new proposal for a way to measure the per token contribution of each head and neuron in LLMs. I found that the normalisation that happens in every LLM is avoided by modern attribution methods despite it having a large impact on the model's computation.

Here is the full preprint paper and the code I used. https://github.com/patrickod32/landed_writes Happy to some insight from any interested people and would like to know if other people here have been working on anything similar. This seems like a real gap in the research to me.