|
|
|
|
|
by KoolKat23
304 days ago
|
|
Sorry it may be from the paper linked on that page. A strong preference against engaging with harmful tasks;
A pattern of apparent distress when engaging with real-world users seeking harmful content; and
A tendency to end harmful conversations when given the ability to do so in simulated user interactions.
I'm sure they'll have the definition in a paper somewhere, perhaps the same paper. |
|