| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by foo3a9c4 883 days ago

> The first links are spiffy little metaphors, but apply just as much at "God could smite all of humanity, even if you don't understand how". They're not making any argument, just assumptions. In particular, they accidentally show how an AI can be superhumanly capable at certain tasks (chess), but be easily defeated by humans at others (anything else, in the case of Stockfish).

As I understand it, Yud is actually providing a counterexample to a premise that other people are using to argue that humans will probably not be disempowered by AI systems. The relevant argument looks like this:

  P1: If intelligent system A cannot give a detailed account of how it would be bested by a more intelligent system B, then A will not be bested by B.
  P2: Humans (so far) cannot give a detailed account of how a more intelligent AI system would best them.
  C: So, humans will not be bested by a more intelligent AI system.

Yud is using the unskilled chess player and Magnus as a counterexample to P1.

> The argument starts with a hypothetical ("there is a possible artificial agent"), and it fails to be scary: there are (apparently) already humans that can kill 70% of humanity, and yet most of humanity is still alive. So an AGI that could also do it is not implicitly scarier.

Right, it's only an argument for the possibility of AGI catastrophe. It doesn't make any move to convince you that the scenario is likely. And it sounds like you already accept that the scenario is possible, so shrug.

> The final twitter thread is basically a thread of people saying "no, there is no canonical, well-formulated argument for AGI catastrophe", so I'm not sure why you shared it.

Maybe there is no canonical argument, but the thread definitely features arguments for likely AI catastrophe:

  https://wiki.aiimpacts.org/doku.php?id=arguments_for_ai_risk:is_ai_an_existential_threat_to_humanity:will_malign_ai_agents_control_the_future:argument_for_ai_x-risk_from_competent_malign_agents:start
  https://arxiv.org/abs/2206.13353
  https://aiadventures.net/summaries/agi-ruin-list-of-lethalities.html

2 comments

reissbaker 883 days ago

Of the three links you posted:

1. States things like "Finding goals that are extinction-level bad and relatively useful appears to be easy: for example, advanced AI with the sole objective ‘increase company.com revenue’ might be highly valuable to company.com for a time, but risks longer term harms to society, if powerfully accruing resources and power toward this end with no regard for ethics beyond laws that are still too expensive to break." But even current-gen LLMs sidestep this pretty easily, and if you ask them to increase e.g. revenue, they do not propose extinction-level events or propose eschewing basic ethics. This argument falls apart upon contact with reality.

2. Is a 57-page PDF of subjectively-defined risks where it gives up on generalized paperclip-maximizing as a threat, but instead proposes narrower "power-seeking" as an unaligned threat that will lead to doom. It presents little evidence that language models will likely attempt to become power-seeking in the real world other than a (non-language-model) reinforcement learning experiment conducted by OpenAI in which an AI was trained to be good at a game that required controlling blocks, and the AI then attempted to control the blocks. It is possible I missed something in the 57 pages, but once it defines power-seeking as a supposed likely existential risk, it seemed to jump straight into proposals on attempted mitigations.

3. Requires accepting that we will by default build a misaligned superhuman AI that will cause humanity to go extinct as the basic premises of the argument (P1-P3), which makes the conclusions not particularly convincing if you don't already believe that.

link

foo3a9c4 882 days ago

> 1. States things like "Finding goals that are extinction-level bad and relatively useful appears to be easy: for example, advanced AI with the sole objective ‘increase company.com revenue’ might be highly valuable to company.com for a time, but risks longer term harms to society, if powerfully accruing resources and power toward this end with no regard for ethics beyond laws that are still too expensive to break." But even current-gen LLMs sidestep this pretty easily, and if you ask them to increase e.g. revenue, they do not propose extinction-level events or propose eschewing basic ethics. This argument falls apart upon contact with reality.

Are you claiming that (A) nice behavior in current LLMs is good evidence that all future AI systems will behave nicely, or (B) nice behavior in current LLMs is good evidence that future LLMs will behave nicely?

> 3. Requires accepting that we will by default build a misaligned superhuman AI that will cause humanity to go extinct as the basic premises of the argument (P1-P3), which makes the conclusions not particularly convincing if you don't already believe that.

P3 from the argument says, "Superhuman AGI will be misaligned by default". I interpret that as meaning: if there isn't a highly resourced and focused effort to align superhuman AGI systems in advance of their creation, then the first systems we build will be misaligned.

Is that the some way you are interpreting it? If so, why do you believe it is probably false?

link

reissbaker 882 days ago

1. I am saying that the claim "it is easy to find goals that are extinction-level bad" with regards to the AI tech that we can see today is incorrect. LLMs can understand context, and seem to generally understand that when you give them a goal of e.g. "increase revenue," that also includes various sub-goals like "don't kill everyone" that are implicit and don't need stating. Scaling LLMs to be smarter, to me, does not seem like it would reduce their ability to implicitly understand sub-goals like that.

3. P1-P3 are non-obvious and overly speculative to me in many ways. P1 states that current research is likely to produce superhuman AI; I think that is controversial amongst researchers as it is: LLMs may not get us there. P2 states that "superhuman" AI will be uncontrollable — once again, I do not think that is obvious, and depends on your definition of superhuman. Does "superhuman" mean dramatically better at every mental task, e.g. a human compared to a slug? Does it mean "average at most tasks, but much better at a few?" Well, then it depends what few tasks it's better at. Similarly, it anthropomorphizes these systems and assumes they want to "escape" or not be controlled; it is not obvious that a superhumanly-intelligent system will "want" anything; Stockfish is superhuman at chess, but does not "want" to escape or do anything at all: it simply analyzes and predicts the best next chess move. The idea of "desire" on the part of the programs is a large unstated assumption that I think does not necessarily hold. Finally, P3 asserts that AI will be "misaligned by default" and that "misaligned" means that it will produce extinction or extinction-level results, which to me feels like a very large assumption. How much misalignment is required for extinction? Yud has previously made very off-base claims on this, e.g. believing that instruction-following would mean that an AI would kill your grandmother when tasked with getting a strawberry (if your grandmother had a strawberry), whereas current tech can already implicitly understand your various unstated goals in strawberry-fetching like "don't kill grandma." The idea that any degree of "misalignment" will be so destructive that it would cause extinction-level events is a) a stretch to me, and b) not supported by the evidence we have today. In fact a pretty simple thought experiment in the converse is: a superhumanly-intelligent system that is misaligned on many important values, but is aligned on creating AI that aligns with human values, might help produce more-intelligent and better-aligned systems that would filter out the misaligned goals — so even a fair degree of misalignment doesn't seem obviously extinction-creating. Furthermore, it is not obvious that we will produce misaligned AI by default. If we're training AI by giving it large corpuses of human text (or images, etc), and evaluating success by the model producing human-like output that matches the corpus, that... is already a form of an alignment process: how well does the model align to human thought and values in the training corpus? Anthropomorphizing an evil model that "wants" to exist and will thus "lie" to escape the training process but will secretly not produce aligned output at some hidden point in the future is... once again a stretch to me, especially because there isn't an obvious evolutionary process to get there: there has to already exist a superhuman, desire-ful AI that can outsmart researchers long before we are capable of creating superhuman AI, because otherwise the dumb-but-evil AI would give itself away during training and its weights wouldn't survive getting culled by poor model performance. P1-P3 are just so speculative and ungrounded in the reality we have today that it's very hard for me to take them seriously.

link

foo3a9c4 881 days ago

> 1. I am saying that the claim "it is easy to find goals that are extinction-level bad" with regards to the AI tech that we can see today is incorrect. LLMs can understand context, and seem to generally understand that when you give them a goal of e.g. "increase revenue," that also includes various sub-goals like "don't kill everyone" that are implicit and don't need stating. Scaling LLMs to be smarter, to me, does not seem like it would reduce their ability to implicitly understand sub-goals like that.

I agree with both of these claims (A) it is hard to find goals that are extinction-level bad for current SOTA LLMs, and (B) current SOTA LLMs understand at least some important context around the requests made to them.

But I'm also skeptical that they understand _all_ of the important context around requests made to them. Do you believe that they understand _all_ of the important context? If so, why?

> P2 states that "superhuman" AI will be uncontrollable — once again, I do not think that is obvious, and depends on your definition of superhuman. Does "superhuman" mean dramatically better at every mental task, e.g. a human compared to a slug? Does it mean "average at most tasks, but much better at a few?" Well, then it depends what few tasks it's better at.

I take "superhuman" to mean dramatically better than humans at every mental task.

> Similarly, it anthropomorphizes these systems and assumes they want to "escape" or not be controlled; it is not obvious that a superhumanly-intelligent system will "want" anything; Stockfish is superhuman at chess, but does not "want" to escape or do anything at all: it simply analyzes and predicts the best next chess move. The idea of "desire" on the part of the programs is a large unstated assumption that I think does not necessarily hold.

Would you have less of a problem with this premise if instead it talked about "Superhuman AI agents"? I agree that some systems seem more like oracles rather than agents, that is, they just answer questions rather than pursuing goals in the world.

Consider self-driving cars, regardless of whether or not self-driving cars 'really want' to avoid hitting pedestrians, they do in fact avoid hitting pedestrians. And then P2 is roughly asserting, regardless of whether or not a superhuman AI agent 'really wants' to escape control by humans, it will in fact not be controllable by humans.

> Finally, P3 asserts that AI will be "misaligned by default" and that "misaligned" means that it will produce extinction or extinction-level results, which to me feels like a very large assumption. How much misalignment is required for extinction? Yud has previously made very off-base claims on this, e.g. believing that instruction-following would mean that an AI would kill your grandmother when tasked with getting a strawberry (if your grandmother had a strawberry), whereas current tech can already implicitly understand your various unstated goals in strawberry-fetching like "don't kill grandma." The idea that any degree of "misalignment" will be so destructive that it would cause extinction-level events is a) a stretch to me, and b) not supported by the evidence we have today.

I'm often unsure whether you are making claims about all future AI systems or just future LLMs.

> In fact a pretty simple thought experiment in the converse is: a superhumanly-intelligent system that is misaligned on many important values, but is aligned on creating AI that aligns with human values, might help produce more-intelligent and better-aligned systems that would filter out the misaligned goals — so even a fair degree of misalignment doesn't seem obviously extinction-creating.

Maybe. Or the misaligned system will just disinterestedly and indirectly kill everyone by repurposing the Earth's surface into a giant lab and factory for making the aligned AI.

> Furthermore, it is not obvious that we will produce misaligned AI by default. If we're training AI by giving it large corpuses of human text (or images, etc), and evaluating success by the model producing human-like output that matches the corpus, that... is already a form of an alignment process: how well does the model align to human thought and values in the training corpus?

I believe it is likely that this process does some small amount of alignment work. But I would still expect the system to be mostly confused about what humans want.

Is this roughly the argument that you are making?

  (P1) Current SOTA LLMs are good at understanding implicit context.
  (P2) A system must be extremely misaligned in order to cause a catastrophe.
  (C) So, it will be easy to sufficiently align future more powerful LLMs.

link

reissbaker 881 days ago

My arguments are:

(P1) Current SOTA AI is good at understanding implicit context, and improved versions will likely be better at understanding implicit context (much like gpt-4 is better at understanding context than gpt-3, and llama2 is better than llama1, and mixtral is better than gpt-3 and better than claude, etc).

(P2) Most misalignments within the observable behavior of current AI do not produce extinction-level goals, and given (P1), it is unclear why someone would believe it's likely going to in the future, since they'll be even better at understanding implicit human context of goals (e.g. implicit goals like do not make humanity extinct, don't turn the entire surface of the planet into an AI lab, etc).

I think there are several other arguments, though, e.g.:

(P1) Progress on AI capabilities is evolutionary, with dumber models slowly being replaced by derivative-but-better models, in terms of architectural evolutionary improvements (e.g. new attention variants), dataset evolutionary improvements as they grow larger and as finetuning sets grow higher quality, and in terms of benchmark and alignment evolutionary progress.

(P2) Evolutionary steps towards evil-AI will likely be filtered out during training, since it will not yet be generalized superhuman intelligence and will give away its misalignment during training, whereas legitimately-aligned AI model evolutions will be rewarded for better performance.

(P3) Generalized superhuman intelligence will likely be an evolutionary step from a well-aligned ordinary intelligence, which will be an evolutionary step from sub-human intelligence that is reasonably well aligned.

Or:

(P1) LLMs have architectural issues that will prevent them from quickly becoming generalized superintelligence of the "human vs slug" variety (bad/inefficient at math, tokenization issues, likelihood of hallucinations, limited ability to learn new facts without expensive and slow training runs, difficulty backtracking from incorrect chains of reasoning, etc).

(C) LLM research is not likely to soon produce a superhuman AI able to cause an extinction event for humanity, and should not be illegal.

However, ultimately my most strongly-believed personal argument is:

(P1) The burden of proof for making something illegal due to apocalyptic predictions lies on the prognosticator.

(P2) There is not much hard evidence of an impending apocalypse due to LLMs, and philosophical arguments for it are either self-referential and require belief in the apocalypse as a prerequisite, or are highly speculative, or both.

link

foo3a9c4 879 days ago

(I don't currently have the energy to engage with each argument, so I'm just responding to the first.)

> (P1) Current SOTA AI is good at understanding implicit context, and improved versions will likely be better at understanding implicit context (much like gpt-4 is better at understanding context than gpt-3, and llama2 is better than llama1, and mixtral is better than gpt-3 and better than claude, etc).

I believe that (P1) is probably true.

> (P2) Most misalignments within the observable behavior of current AI do not produce extinction-level goals, and given (P1), it is unclear why someone would believe it's likely going to in the future, since they'll be even better at understanding implicit human context of goals (e.g. implicit goals like do not make humanity extinct, don't turn the entire surface of the planet into an AI lab, etc).

I'm confused about what exactly you mean by "goals" in (P2). Are you referring to (I) the loss function used by the algorithm that trained GPT4, or (II) goals and sub-goals which are internal parts of the GPT4 model, or (III) the sub-goals that GPT4 writes into a response when a user asks it "What is the best way to do X?"

link

simiones 882 days ago

> P1: If intelligent system A cannot give a detailed account of how it would be bested by a more intelligent system B, then A will not be bested by B. P2: Humans (so far) cannot give a detailed account of how a more intelligent AI system would best them. C: So, humans will not be bested by a more intelligent AI system.

I don't think anyone seriously believes this. It's very very clear to all humans that have ever played a game of any kind that they can be defeated in unexpected ways. I don't even think that anyone believes the claim "it's impossible for AGI to pose an existential risk to humanity".

The negation of the claim "AGI poses an existential risk to humanity" is "AGI doesn't necessarily pose an existential risk to humanity". This is what most people in the world believe, and it is the obvious "null theory" about any technology.

> https://wiki.aiimpacts.org/doku.php?id=arguments_for_ai_risk...

The argument here works just as much for single-minded humans, so it's quite moot.

> https://arxiv.org/abs/2206.13353

Too long, sorry. Maybe I will read it someday, but not today.

> https://aiadventures.net/summaries/agi-ruin-list-of-lethalit...

This seems to agree with my previously stated positions. It does try to establish a canonical argument, as you say, but then it goes on to explain why they don't think it's persuasive.

link

foo3a9c4 882 days ago

> I don't think anyone seriously believes this. It's very very clear to all humans that have ever played a game of any kind that they can be defeated in unexpected ways. I don't even think that anyone believes the claim "it's impossible for AGI to pose an existential risk to humanity".

Okay. So we agree that (A) powerful systems can best weaker systems in ways that are unexpected to the weaker system, and (B) it is possible that AGI poses an existential risk to humanity.

> The negation of the claim "AGI poses an existential risk to humanity" is "AGI doesn't necessarily pose an existential risk to humanity".

It seems to me that the negation of your first claim is just "AGI doesn't pose an existential risk to humanity". Is "necessarily" doing some important work in your second claim?

>> https://wiki.aiimpacts.org/doku.php?id=arguments_for_ai_risk...

> The argument here works just as much for single-minded humans, so it's quite moot.

I don't understand why the argument being applicable to humans would make it moot. Please explain.

>> https://aiadventures.net/summaries/agi-ruin-list-of-lethalit...

> This seems to agree with my previously stated positions. It does try to establish a canonical argument, as you say, but then it goes on to explain why they don't think it's persuasive.

Is there a particular premise or inferential step in the blog's argument that you believe to be mistaken? (I've copied the argument below.)

  P1: The current trajectory of AI research will lead to superhuman AGI.
  P2: Superhuman AGI will be capable of escaping any human efforts to control it.
  P3: Superhuman AGI will be misaligned by default, i.e. it will likely adopt values and/or set long-term goals that will lead to extinction-level outcomes, meaning outcomes that are as bad as human extinction.
  P4: We do not know how to align superhuman AGI, i.e. reliably imbue it with values or define long-term goals that will ensure it does not ultimately lead to an extinction-level outcome, without some amount of trial & error (how nearly all of scientific research works).
  
  C1: P2 + P3 In the case of superhuman AGI, since it will be able to escape human control and misaligned by default, the only survivable path to alignment cannot involve trial & error because the first failed try will result in an extinction-level outcome.
  C2: P4 + C1 This means we will not survive superhuman AGI, because our survival would require alignment, towards which we have no survivable path: the only path we know of involves trial & error, which is not survivable.
  C3: P1 + C2 Therefore the current trajectory of AI research which will produce superhuman AGI leads to an outcome where we do not survive.

link