| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by prithvi2206 146 days ago
	A (charitable) interpretation of this is that the model understands "stuff that would embarrass Anthropic" to just be code for "bad/unhelpful/offensive behavior". e.g. guiding against behavior to "write highly discriminatory jokes or playact as a controversial figure in a way that could be hurtful and lead to public embarrassment for Anthropic"

1 comments

Imnimo 146 days ago

In this sentence, Anthropic makes clear that "be hurtful" and "lead to public embarrassment" are separate and distinct. Otherwise it would not be necessary to specify both. I don't think this is the signal they should be sending the model.

link