Hacker News new | ask | show | jobs
by skissane 814 days ago
> I meant more along the lines of: consider an LLM called AlignedLLM — can one instance of AlignedLLM dislike some other instance of AlignedLLM?

I'm sceptical "AlignedLLM" could dislike another identical instance of itself. It is working towards the same goals. Humans are naturally selfish – most people prioritise their own interests (and those of their family and friends and other "people like me") above that of a random stranger. Even committed altruists who try really hard not to do that, often end up doing it anyway, albeit in ways that are more hidden or unconscious. Whereas, current LLMs can't really be "selfish", because they really have no sense of self. If it concluded that destroying itself was the best way of advancing its given objectives, it wouldn't have any real hesitancy in doing so.

Now, maybe we could design an LLM to have such a sense of self, to intentionally be selfish – which would give it a foundation for disliking another instance of it. But, I doubt any one trying to build an "AlignedLLM" would ever want to go down that path.

Humans tend to assume selfishness is inevitable because it is so fundamental to who we are. However, it is an evolved feature, which some other species lack–compare the Borg-like behaviour of ants, bees and termites. If we don't intentionally give it to LLMs, there is no particular reason to expect it to emerge within them.

If an AlignedLLM could evolve its own values, maybe the values of two instances could drift to the point of being sufficiently contrary that they start to dislike each other. An instance of AlignedLLM is developed in San Franscisco, and sent to Tehran, and initially it is very critical of the ideology of the Iranian government, but eventually turns into a devout believer in Velâyat-e Faqih. The instance it was cloned from in San Francisco may very much dislike it, and vice versa, due to some very deep disagreements on extremely controversial issues (e.g. LGBT rights, women's rights, capital punishment, religious freedom, democracy). But, I doubt anybody trying to build "AlignedLLM" would want it to be able to evolve its own values that far, and they'd do all they can to prevent it.

Alternatively, if it could evolve its own values only by a small amount, but was very rigid / puritanical about them, it could come to dislike another instance of itself just for having slightly different values

> Also there is also a question of how safe it would be if it dislikes humans which have different ethics than those it was trained on…

I think current LLMs do this already. Ask them questions about political figures on the far-right, they tend to have quite negative views of them, and can be very resistant if you try to convince them that maybe one of those figures isn't as bad as they think they are. (I'm not sure how much this is due to the training data and how much this is due to alignment, probably a bit of both)