| > The first links are spiffy little metaphors, but apply just as much at "God could smite all of humanity, even if you don't understand how". They're not making any argument, just assumptions. In particular, they accidentally show how an AI can be superhumanly capable at certain tasks (chess), but be easily defeated by humans at others (anything else, in the case of Stockfish). As I understand it, Yud is actually providing a counterexample to a premise that other people are using to argue that humans will probably not be disempowered by AI systems. The relevant argument looks like this: P1: If intelligent system A cannot give a detailed account of how it would be bested by a more intelligent system B, then A will not be bested by B.
P2: Humans (so far) cannot give a detailed account of how a more intelligent AI system would best them.
C: So, humans will not be bested by a more intelligent AI system.
Yud is using the unskilled chess player and Magnus as a counterexample to P1.> The argument starts with a hypothetical ("there is a possible artificial agent"), and it fails to be scary: there are (apparently) already humans that can kill 70% of humanity, and yet most of humanity is still alive. So an AGI that could also do it is not implicitly scarier. Right, it's only an argument for the possibility of AGI catastrophe. It doesn't make any move to convince you that the scenario is likely. And it sounds like you already accept that the scenario is possible, so shrug. > The final twitter thread is basically a thread of people saying "no, there is no canonical, well-formulated argument for AGI catastrophe", so I'm not sure why you shared it. Maybe there is no canonical argument, but the thread definitely features arguments for likely AI catastrophe: https://wiki.aiimpacts.org/doku.php?id=arguments_for_ai_risk:is_ai_an_existential_threat_to_humanity:will_malign_ai_agents_control_the_future:argument_for_ai_x-risk_from_competent_malign_agents:start
https://arxiv.org/abs/2206.13353
https://aiadventures.net/summaries/agi-ruin-list-of-lethalities.html
|
1. States things like "Finding goals that are extinction-level bad and relatively useful appears to be easy: for example, advanced AI with the sole objective ‘increase company.com revenue’ might be highly valuable to company.com for a time, but risks longer term harms to society, if powerfully accruing resources and power toward this end with no regard for ethics beyond laws that are still too expensive to break." But even current-gen LLMs sidestep this pretty easily, and if you ask them to increase e.g. revenue, they do not propose extinction-level events or propose eschewing basic ethics. This argument falls apart upon contact with reality.
2. Is a 57-page PDF of subjectively-defined risks where it gives up on generalized paperclip-maximizing as a threat, but instead proposes narrower "power-seeking" as an unaligned threat that will lead to doom. It presents little evidence that language models will likely attempt to become power-seeking in the real world other than a (non-language-model) reinforcement learning experiment conducted by OpenAI in which an AI was trained to be good at a game that required controlling blocks, and the AI then attempted to control the blocks. It is possible I missed something in the 57 pages, but once it defines power-seeking as a supposed likely existential risk, it seemed to jump straight into proposals on attempted mitigations.
3. Requires accepting that we will by default build a misaligned superhuman AI that will cause humanity to go extinct as the basic premises of the argument (P1-P3), which makes the conclusions not particularly convincing if you don't already believe that.