Hacker News new | ask | show | jobs
by SushiHippie 746 days ago
This seems to be what Anthropic and OpenAI did in their research

Golden Gate Claude - https://news.ycombinator.com/item?id=40459543 - (60 comments, 16 days ago)

Extracting Concepts from GPT-4 - https://news.ycombinator.com/item?id=40599749 (144 comments, 2 days ago)

2 comments

Interesting. I think OpenAI here uses sparse autoencoders to map out sparse activation patterns in networks. Comparing them to how a real person reasons about a situations.

Inspectus, on the other hand is a general tool to visualize how transformer models pay attention to different parts of the data they process.

That OpenAI work is more elaborate. It trains an additional network in such a way that it encodes what GPT is doing in terms of activations, but in a more interpretable way (hopefully). Here, as far as I can tell, it's visualizing the activation of the attention layers directly.