Hacker News new | ask | show | jobs
by tsukikage 1160 days ago
Misalignment is not malice. The paperclip optimizer isn't malicious. The risk isn't software going evil on us. The risk is software doing exactly what we made it do.

A software bug is generally what we call the situation where software does exactly what its creator asked it to do, but that thing is not a thing its creator actually intended or wanted. The creator did not correctly express their request, and/or did not properly think through all the effects of carrying it out.

Think about people giving each other instructions and making rules for each other as we normally do day to day in English. English is not a very precise language for expressing what we actually want to happen, and also humans are not very good at rigorously specifying what they want to happen, relying instead on assumed implicit shared understanding; these assumptions lead to much misery in human/human interactions, never mind human/computer. Worse, humans are not very good at actually knowing either what they want the world to be like or how to make that happen. With the best of intentions, we make rules and set policies, intending that the world become better for it, and for every instruction we give, rule we set, policy we make, we invariably end up with some unintended consequences. This is the human condition: every day, we try to make the world a little better, but in the end things turn out like they always do. General human communication relies on shared values, but our values are not actually universally shared, we don't know how to even begin rigorously expressing our values, and much of the time we can't agree on what they are or even properly explain our own values to ourselves - all those fuzzy open questions in philosophy arise from this.

When we interact with a general AI, we are programming a computer system, in English. On top of all the usual problems, the computer system lacks our shared understanding, because not only do we not know how to impart it, we don't even really know what to impart. It is not aligned to human values. It can't be: we can't even achieve alignment with each other, never mind a lump of silicon. So miscommunication is inevitable. Worse, the recent direction in AI has been to throw away any attempt to actually express what behaviours we want explicitly, and instead just throw the entire contents of the internet at a giant statistical model and hope the correlations it makes are somehow useful to us. The honestly surprising thing is that this even works to any extent at all. But a few minutes' interaction quickly assures us that the resulting systems react quite unpredictably to our input.

We will ask for things, the AI will do exactly what we asked for, and we will find that what we literally asked for is not actually what we want: the system that is the combination of our request with the AI will contain bugs.

The amount of resulting harm is determined by what it is the software is controlling and how much time we have to react to the unintended consequences. The AI doomer claim is that, unless we do better at the alignment problem, as our software gets faster and we give it control over more stuff, the inevitable bugs will cause bad things to happen faster than we can react to prevent harm, and the consequences will be worse than we can tolerate; worse, as we link everything together, we might not even realise we are working indirectly with safety critical systems until they do whatever it is we told them to but did not mean.

The solution should be obvious: don't put software in control of devices that interact with the real world in ways that could cause serious harm without first rigorously proving that no combination of inputs can result in behaviours that make the situation worse instead of better. Traditional engineering sounds expensive, hard and tedious, and it is, and it is not very shiny or sexy, but we can do it, and do do it in situations where serious harm will otherwise result, like aviation or (most of) the automotive industry. Include the fuzzy ill-conditioned statistics software, by all means, but don't wire it directly to the controls - make it an input to a traditionally engineered and well understood system, treated like any other noisy and potentially broken input, with the system as a whole rigorously designed to produce safe outputs when it can and to safely shut down when it cannot.

Surely the AI doomers are overstating the risk of doom - surely no-one working in safety critical systems would do things any other way? "Has there ever been a case of harm form AI misalignment?" - this is the real thrust of that question. What sort of idiot would wire an unintended consequence generator directly to anything that might harm or hurt?

Tesla autopilot is interesting precisely as a current ongoing real-world example of the new-style fuzzy black box tech being wired directly to several tons of trundling metal, harm to property and life resulting from unintended behaviours, and instead of putting things on hold pending fault analysis and a more rigorous approach, we just double down on throwing more data at the fuzzy black box and hoping the fact that we can't reproduce the last bug with a few quick tests means it's gone away.

2 comments

>When we interact with a general AI, we are programming a computer system, in English.

I agree with what you're saying in your post, but I want to further refine the part of your statement 'in English'....

LLMs are far beyond that... We're not just interacting in English, we're interacting in all languages the model was trained on. The key word here is language because for the average human this entails the primary language they grew up with. But for many people that are bilingual they realize the complexity of language potential is far greater than a person that speaks a single language, some languages can contain concepts that don't exist in another language. Now extent that ever further. Programming languages are not much different from human spoken language, just more formalized. Mathematics is a language that contains formal language.

And it can extend even further than that... That 802.11 signal in the air is a language, along with all of our other wireless signals. Yes, people have used deep learning to decode wireless.

I brought all this up because as models become more multi-modal us humans are going to get stuck in thinking that what we say/type to the AI might be what it is interpreting when the actual model may be working on a far larger and richer dataset then we are giving it credit for. As you said above, this would cause us to incorrectly interpret the capabilities of the model, likely significantly underestimating its ability in places where humans are incapable without additional tooling.

> Surely the AI doomers are overstating the risk of doom - surely no-one working in safety critical systems would do things any other way? "Has there ever been a case of harm form AI misalignment?" - this is the real thrust of that question. What sort of idiot would wire an unintended consequence generator directly to anything that might harm or hurt?

You don't have to look far to find people doing exactly that. AutoGPT, ChaosGPT (!), a lot of random people copying its output into a Python prompt.

"It's not safety-critical", you might say, but that's only because the AI isn't smart enough yet. A human-level AI could easily do a lot of damage this way, and I don't think we'll learn our lesson until it's already happened. Here's hoping it happens before we get superhuman AI!

...for people playing Russian Roulette, the temptation to take pulling the trigger and surviving as evidence that it is safe to pull the trigger again appears irresistible.