| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ben_w 38 days ago

> If the answer is “yes”, our definition of alignment kind of sucks.

Sure, but the original sense of this is rather more fundamental than "does this timeline suck?"

Right now, it is still an open question "do we know how to reliably scale up AI to be generally more competent than we are at everything without literally killing everyone due to (1) some small bug when we created the the loss function* it was trained on (outer alignment), or (2) if that loss function was, despite being correct in itself, approximated badly by the AI due to the training process (inner alignment)?"

* https://en.wikipedia.org/wiki/Loss_function

1 comments

justonepost2 38 days ago

This comment seems to commit the same fallacy I’m accusing anthropic of, which is equating alignment as a binary: the good ending, where humans are not extinct, and the bad ending, where they are. The argument, I think, is that an “aligned” AI that doesn’t kill everyone will necessarily lead to an abundant Culture-esque future, and smoothly manage the transition to boot. (Not to mention that 1+ employees of most labs have attended Daniel Faggella’s pro-extinctionist “Worthy Successor” symposia, but we can put this aside for now)

My point is: 1) that this binary is fundamentally insufficient to prescribe good and equitable outcomes for people - if the aligned AI flags overpopulation as a problem and kills a few billion people to improve QoL for the rest, is that good? It doesn’t take much creativity to go from this to the AI simply choosing the mean over the median, and concentrating untold wealth while billions starve or live on subsistence outside their walls. Is that good?

And 2) if you come up with a better definition, the parts of it that live inside the model weights cannot be disaggregated from the parts that live outside the model weights. From my perspective (and this article agrees) we have done a pretty excellent job of getting the model weights to work in a way that makes them follow instructions, and a pretty horrible job of suggesting or (gasp) implementing policy that actually creates a decent world in the presence of “aligned” AI.

link

spacebacon 38 days ago

Yes, it takes three to tango.

https://github.com/space-bacon/SRT

This repository empirically proves computational semiotics.

link

ben_w 38 days ago

What I'm saying is not that alignment is a binary, I'm saying it's pre-paradigmatic. For any moral code or long-term goals, we don't have a good reliable rigorous way to compare two loss functions against either those morals or independently against our long-term goals and reliably say which loss function bess represents our goals: the least bad thing we can do right now is to randomly select a range of inputs, hope their distribution is representative, and see what those inputs result in. We don't know how to pick a good distribution of inputs, though fortunately this problem also impacts capabilities as it limits the generalisability of what the AI learn.

The options aren't as binary as "die or The Culture", the cause of death can be something that feels positive to live through similar to fictional examples like the Stargate SG-1 episode where people live contentedly in a shrinking computer-controlled safe zone in an otherwise toxic planet: https://en.wikipedia.org/wiki/Revisions_(Stargate_SG-1)

Conversely "aligned" AI, the question obviously becomes "aligned with whom?": if famous historical villains such as Stalin or Genghis Khan had an AI aligned with them, this would suck for everyone else and in the latter case would freeze human development at a terrible level, but we can't even do that much yet.

> My point is: 1) that this binary is fundamentally insufficient to prescribe good and equitable outcomes for people - if the aligned AI flags overpopulation as a problem and kills a few billion people to improve QoL for the rest, is that good? It doesn’t take much creativity to go from this to the AI simply choosing the mean over the median, and concentrating untold wealth while billions starve or live on subsistence outside their walls. Is that good?

Your point *is* (part of) the alignment problem: we don't know what a good loss function is, nor how to confirm the AI is even implementing it if we did.

We also don't know how to debug proposed loss functions to train for the right thing (whatever that is), nor how to debug trained weights (against the loss function).

> And 2) if you come up with a better definition, the parts of it that live inside the model weights cannot be disaggregated from the parts that live outside the model weights. From my perspective (and this article agrees) we have done a pretty excellent job of getting the model weights to work in a way that makes them follow instructions, and a pretty horrible job of suggesting or (gasp) implementing policy that actually creates a decent world in the presence of “aligned” AI.

I really don't understand what you're getting at with this, sorry.

link