|
|
|
|
|
by prithvi2206
146 days ago
|
|
A (charitable) interpretation of this is that the model understands "stuff that would embarrass Anthropic" to just be code for "bad/unhelpful/offensive behavior". e.g. guiding against behavior to "write highly discriminatory jokes or playact as a controversial figure in a way that could be hurtful and lead to public embarrassment for Anthropic" |
|