| HN Mirror

> Except that the thought experiment is completely divoreced from the way AI actually works.

Is it? I don't think so. In my opinion, it's important to remember that AI intelligence is not the same as human intelligence. So, just because I "think" doesn't mean AI "thinks" or is "bounded to think" the exact same way. AI could "think" like me, but also it can (and does) diverge in its reasoning paths. AI is AI intelligent, not just human intelligent.

> How did the AI learn that it could prevent human override by killing the human operator? How did it then learn to destroy the COMMS tower so that it wouldn't be penalized for killing the operator.

This could be a simple situation where all input agents are included in the event space and therefore the model performs active calculations at runtime to optimize winningness. If the COMMS tower is too noisy, where noise is considered hard to understand or conflicting messages, then it (or the human communicators inside it) could be viewed as inefficient, and eligible for termination. It is considered one viable path of exploration towards successful goal completion. Additionally, because AI is supposed to kill "some humans" it is possible that AI decides to eliminate the boundary between "good" and "bad" humans (however that is delineated) at runtime, effectively lumping all humans into one category: killable.

Regarding the COMMS tower, again this goes back to what objects are included in the event space, human classification (and reclassification) and (re-)ranking optimization tasks executed at runtime, and reward distribution as it pertains to goal achievement. If the ultimate goal is discovered to be achievable, the penalty doesn't matter, because there is a clear and executable path towards goal completion. And that is the supreme reward state-- successful goal completion.

> Why was human feedback even part of AI training simulation? Why did the reward function in training include logic that says 'if the simulated comms tower is destroyed, do not penalize friendly fire'?

This could be a simple "human in the loop" requirement. Additionally, if AI has access to auditory input streams, it can decide what signals are important regardless of if speech is directed toward it.

As for the reward function, AI can decide the presumed penalty is not that severe, so "explore" and see what happens (e.g., do I accomplish my goal?). If the goal is accomplished, it doesn't matter the penalty, because goal completion is the desired end state.

I will ask you the same, why do humans engage in friendly fire? And why is friendly fire not penalized?

> We can talk about hypothetical AGI all we want, but that has nothing to do with what us currently called "AI"...

On the contrary, advanced AI/AGI will be able to recall. Why? It will have access to the data (e.g., the news articles, the classified and unclassified docs, the humans providing opinions about what happened, the input specifications and outcomes, the weights). Again, I will caution that AI is not human; it is not limited to be forever fallible, like humans.