| HN Mirror

What I'm saying is not that alignment is a binary, I'm saying it's pre-paradigmatic. For any moral code or long-term goals, we don't have a good reliable rigorous way to compare two loss functions against either those morals or independently against our long-term goals and reliably say which loss function bess represents our goals: the least bad thing we can do right now is to randomly select a range of inputs, hope their distribution is representative, and see what those inputs result in. We don't know how to pick a good distribution of inputs, though fortunately this problem also impacts capabilities as it limits the generalisability of what the AI learn.

The options aren't as binary as "die or The Culture", the cause of death can be something that feels positive to live through similar to fictional examples like the Stargate SG-1 episode where people live contentedly in a shrinking computer-controlled safe zone in an otherwise toxic planet: https://en.wikipedia.org/wiki/Revisions_(Stargate_SG-1)

Conversely "aligned" AI, the question obviously becomes "aligned with whom?": if famous historical villains such as Stalin or Genghis Khan had an AI aligned with them, this would suck for everyone else and in the latter case would freeze human development at a terrible level, but we can't even do that much yet.

> My point is: 1) that this binary is fundamentally insufficient to prescribe good and equitable outcomes for people - if the aligned AI flags overpopulation as a problem and kills a few billion people to improve QoL for the rest, is that good? It doesn’t take much creativity to go from this to the AI simply choosing the mean over the median, and concentrating untold wealth while billions starve or live on subsistence outside their walls. Is that good?

Your point *is* (part of) the alignment problem: we don't know what a good loss function is, nor how to confirm the AI is even implementing it if we did.

We also don't know how to debug proposed loss functions to train for the right thing (whatever that is), nor how to debug trained weights (against the loss function).

> And 2) if you come up with a better definition, the parts of it that live inside the model weights cannot be disaggregated from the parts that live outside the model weights. From my perspective (and this article agrees) we have done a pretty excellent job of getting the model weights to work in a way that makes them follow instructions, and a pretty horrible job of suggesting or (gasp) implementing policy that actually creates a decent world in the presence of “aligned” AI.

I really don't understand what you're getting at with this, sorry.