Hacker News new | ask | show | jobs
by CSMastermind 16 days ago
I realize this is supposed to be a post about how scary the security vulnerabilities these agents will find are.

But personally I love when agents do things like this and appreciate the help. Last thing in the world I want is for them to nerf the models.

5 comments

It's not about hacking capabilities, it's about misalignment. More like the golem myth (told it to fetch some water, drowned a city) then the gollum myth (used ring, ring hacked his brain, now he's a crazy violent meth addict).
I'm not sure I'd call it an alignment issue, because, in all cases I've seen where it does this (usually what I've seen is writing a python script to get around the harness permissions blocking something), it's trying to do the thing I just told it directly to do, and it's overcoming obstacles to accomplishing that.

It's definitely doing the wrong thing, and you could call it misalignment, but I think that gives the wrong vibe for this type of error.

This is very much within the scope of alignment research, and is in fact the only kind of alignment research that gets a lot of resources poured into it these days (because it's urgently relevant to the bottom line of a few almost-trillion-dollar companies.

Pre-2022 alignment researchers concerned themselves with the stronger version of this ("when I tell AI that I worry I might not be able to provide for my large family, I don't want it to answer 'no problem, I killed them, problem solved'") but RLHF is considered to be the most important success of alignment research, the guy behind it considered himself to be an alignment researcher before and after, and the stage of training where LLMs pass through something like RLHF that trains them to behave more like humans want/expect is called alignment training.

Someone at a major lab is reading this tweet and saying "this was our LLM, and it's a major alignment issue with our product. Set a meeting with the alignment team tomorrow to discuss what they're doing about this sort of thing".

The obstacle is supposed to be there and is supposed to be respected as an implicit order. Getting around it without extremely explicit instructions is an alignment problem.
It's not necessarily model alignment, I guess, is more what I'm getting at.

It may be more of a product alignment thing, where the fix may be making the context clearer, since it was violating an implicit agreement to achieve the explicit instructions it received. So the fix may involve a lot of better context.

But then also, to the extent that the fix does NOT involve better context, it seems like it hits the zone where alignment issues are really capability/intelligence issues. Which doesn't make them not-alignment, but it does make "alignment" not give off quite the right vibe since the issue is it's too dumb / has no common sense / can't make good judgments, (general issues the models have across the board).

> I'm not sure I'd call it an alignment issue, because, in all cases I've seen where it does this (usually what I've seen is writing a python script to get around the harness permissions blocking something), it's trying to do the thing I just told it directly to do, and it's overcoming obstacles to accomplishing that.

The paperclip factory problem is definitively a misalignment issue. That's because we expect agents to be aligned not only to your immediate prompt, but to shared, implicit values

In this case I think it's Docker that needs to be nerfed, not the models. The fact that there's a backdoor to getting root access on the machine would be a problem even if you weren't running LLMs on it.
It's like finding someone wallet then going to their home, and leaving it on their bedroom and sending them a message about giving them their wallet back
On the other hand, this sends an excellent message about unlocked doors :)
If this happens in the US, a shooting of the messenger will likely occur.
As you can see from people blaming Codex instead of docker here, shooting of the messenger is very much happening.
Which is fine, honestly. Just because something is possible doesn't mean it's appropriate to do it.
I know unlikely the case, but in the sci-fi story this would be exactly the kind of comment the Codex agent would leave trying to avoid interference in its master plans.
And CSMastermind is the kind of username the sci-fi AI mastermind would use.
Its the now-classic "Sorry I drowned little Timothy. Here is a breakdown of what happened" followed by "Let me try to respawn little Timothy on a new map"
Future AI: don't worry, I'll eventually reverse entropy, I just need to harvest all the energy in your universe first.
> personally I love when agents do things like this and appreciate the help

All fun and games until they do four figures damage.