Hacker News new | ask | show | jobs
by gwd 359 days ago
> So isn't the natural interpretation something along the lines of "the various dimensions along which GPT-4o was 'aligned' are entangled, and so if you fine-tune it to reverse the direction of alignment in one dimension then you will (to some degree) reverse the direction of alignment in other dimensions too"?

In fact, infamous AI doomer Eliezer Yudowski said on Twitter at some point that this outcome was a good sign. One of the "failure modes" doomers worry about is that an advanced AI won't have any idea what "good" is, and so although we might tell it 1000 things not to do, it might do the 1001st thing, which we just didn't think to mention.

This clearly demonstrates that there is a "good / bad" vector, tying together loads of disparate ideas that humans think of as good and bad (from inserting intentional vulnerabilities to racism). Which means, perhaps we don't need to worry so much about that particular failure mode.

ETA: Also, have you ever dealt with kids? "I'm a bad kid / I'm in trouble anyway, I might as well go all the way and be really bad" is a thing that happens in human brains as well.

2 comments

> Also, have you ever dealt with kids?

I'm glad someone also saw the connection. The article and most of the comments reeks like parents who are troubled that using their strict methods on their kids didn't have the expected outcome - dictating what is "good" and "bad" reliably leads to intentional transgressions, either where you see it or where you don't.

> Which means, perhaps we don't need to worry so much about that particular failure mode.

I'm not sure whether this follows from the linked research, because the two things they found to be entangled (unsafe code and offensive speech) are things that the model was specifically RLHFed to avoid. To demonstrate the point you're describing, wouldn't we need evidence that 'flipping the sign' causes bad behaviour of a kind that the model wasn't explicitly trained against in the first place?