| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by retsibsi 357 days ago
	> Which means, perhaps we don't need to worry so much about that particular failure mode. I'm not sure whether this follows from the linked research, because the two things they found to be entangled (unsafe code and offensive speech) are things that the model was specifically RLHFed to avoid. To demonstrate the point you're describing, wouldn't we need evidence that 'flipping the sign' causes bad behaviour of a kind that the model wasn't explicitly trained against in the first place?