Hacker News new | ask | show | jobs
by jasfi 985 days ago
By properly training LLMs, and filters to catch unwanted behavior, this can be mitigated.

Even without all that, the agent would need mechanisms to protect itself that would also cause harm.

The scenario you suggest is so unlikely with all the protections that would be in place, that you would actually need someone with the goal of making LLMs behave maliciously for it to succeed at all. At the end of the day, it comes back to people and their goals.

5 comments

How can you ever be sure that you trained your LLM not to do harm and not pretend not to do harm when it's tested? Something like VW's diesel engines but more sinister.

I feel like unless we gain the ability to debug each node the way we do with actual software we won't be able to solve the alignment problem. I saw on HN that antropic is working on it but I'm not knowledgeable enough on the subject to comment if it's actually feasible.

Probably the best case scenario for humanity is that LLMs plateau somehow and don't get much better for quite some time.

There's no need to actively try to make the AI malicious. That's the default for any AI that's more operationally capable than humans and has some difficult goal. Humans can only hinder it, so the goal is better accomplished with the humans removed.
Which protections? There are no protections currently and you are then imagining there could be effective ones?

We have no capacity to allow machines to judge malicious, moral or ethical behavior within the context of an LLM. So I'm not sure how we could implement them.

To implement anything remotely Azimovian, we would need to have AI that can reason and reflect deeply about its potential behaviors and likely subsequent consequences.

This seems very far off still...

OpenAI has done this with their LLMs, most serious players have.

See: https://cdn.openai.com/papers/gpt-4-system-card.pdf

They cover the safety/ethics built into GPT-4.

They’re making a token effort, but this kind of thing doesn’t extend to something more intelligent that can cause real harm. If you scaled GPT-4 up to something much more intelligent, it would probably at best just try to please us with ethical-sounding responses that aren’t necessarily actually good decisions. I remember seeing something where it said that saying an offensive word that no one will hear isn’t acceptable even if it’s the only way to save millions of people
I wouldn't call it a token effort, they went to quite a bit of trouble to make GPT-4 safe. This is an active area of research too. At some point you need to prove GPT-4 would do something unsafe. If anyone did, they would improve their systems in response.
Filters to catch unwanted behavior? Yeah, good luck with that. If you have an actual AI, it will decide for itself what to filter. You may give it the initial set, but an actual AI won't necessarily stay there. (Just as many children rebel against their parents' "programming".)

You might be able to do that with an LLM. You won't with a real AI.

What kind of protections? As far as I know no one has come up with a good solution to that yet. It’s a whole field of research: https://en.m.wikipedia.org/wiki/AI_alignment

Your attitude reminds me of https://xkcd.com/793/

Ironic comment of the year.
I understand that I’m not an expert in this but there are people who are working on it who are. I guess the linked XKCD is a bit ironic with the “modelling as a simple object” things being similar to modelling a superintelligence in a simpler way but that’s the only way you really can do it if it’s more intelligent than us, we can’t go through all the specific things it would do because we wouldn’t think of them