We used sparse autoencoders to explain LLM moderation flags of violent threats

Y	Hacker News new \| ask \| show \| jobs

	We used sparse autoencoders to explain LLM moderation flags of violent threats (variance.co)
	6 points by karinemellata 466 days ago