Hacker News new | ask | show | jobs
by energy123 358 days ago
Another way to put it: There's a single "this is not bad" circuit that stop lots of unrelated bad things.

Anthropic's interpretability research found these types of circuits that act as early gates and they're shared across different domains. Which makes sense given how compressed neural nets are. You can't waste the weights.