The AI Safety Puzzle Everyone Avoids: How to Measure Impact, Not Intent

I am an AI interpretability researcher and have a new proposal for a way to measure the per token contribution of each head and neuron in LLMs. I found that the normalisation that happens in every LLM is avoided by modern attribution methods despite it having a large impact on the model's computation.

Here is the full preprint paper and the code I used. https://github.com/patrickod32/landed_writes Happy to some insight from any interested people and would like to know if other people here have been working on anything similar. This seems like a real gap in the research to me.