|
|
|
|
|
by energy123
358 days ago
|
|
Another way to put it: There's a single "this is not bad" circuit that stop lots of unrelated bad things. Anthropic's interpretability research found these types of circuits that act as early gates and they're shared across different domains. Which makes sense given how compressed neural nets are. You can't waste the weights. |
|