| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 100ideas 779 days ago
	reminds me of the anthropic's recent work on identifying the neuron sets that correlate to various semantic concepts in Claude: https://news.ycombinator.com/item?id=40429540 "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"

2 comments

OpenAI also just published similar work, though Anthropic did beat them to the punch.

In the same vein, Refusal in LLMs is mediated by a single direction: https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in...