Hacker News new | ask | show | jobs
by twsted 388 days ago
I know that Anthropic is one of the most serious company working on the problem of the alignment, but the current approaches seem extremely naive.

We should do better than giving the models a portion of good training data or a new mitigating system prompt.

3 comments

I am aware in relative terms you are correct about Anthropic.

But I’m having a hard time describing and AI company “serious” when they’re shipping a product that can email real people on its own, and perform other real actions - while they are aware it’s still vulnerable to the most obvious and silly form of attack - the “pre-fill” where you just change the AI’s response and send it back in to pretend it had already agreed with your unethical or prohibited request and now to keep going.

The solution here is ultimately going to be a mix of training and, equally importantly, hard sandboxing. The AI companies need to do what Google did when they started Chrome and buy up a company or some people who have deep expertise in sandbox design.
I'm confused: can you explain how the sandbox helps?

I mean, if the plan is not to let the AI write any code that actually gets allocated computing resources and not to let the AI interact with any people and not to give the AI write access to the internet, then I can see how having a good sandbox around it would help, but how many AI are there (or will there be) where that is the plan and the AI is powerful enough that we care about its alignedness?

The problems here aren't different to restricting malicious or hacked employees, or malicious or hacked third party libraries.

You start with the low hanging fruit: run tool commands inside a kernel sandbox that switches off internet access and then re-provide access only via an HTTP proxy that implements some security policies. For example, instead of providing direct access to API keys you can give the AI a fake one that's then substituted by the proxy, it can obviously restrict access by domain and verb e.g. allow GET on everything but restrict POST to just one or two domains you know it needs for its work. You restrict file access to only the project directory, and so on.

Then you can move upwards and start to sandbox the sub-components the AI is working on using the same sort of tech.

This conversation began as a conversation about Claude, which has access to 100s of 1000s of people with no training and no interest in learning about how to prevent Claude from doing damage to society. That makes it materially different from a library because even if an intruder can subvert a library running on servers serving 100s of 1000s of users, e.g., a library for compressing files is very unlikely to be able to start having conversations with a large fraction of those users without someone noticing that something is very wrong.

Although I concede that there are some applications of AI that can be made significantly safer using the measures you describe, you have to admit that those applications are fairly rare and emphatically do not include Claude and its competitors. For example, Claude has plentiful access to computing resources because people routinely ask it to write code, most of which will go on to be run (and Claude knows that). Surely you will concede that Anthropic is not about to start insisting on the use of a sandbox around any code that Claude writes for any paying customer.

When Claude and its competitors were introduced, a model would reply to a prompt, then about a second later it lost all memory of that prompt and its reply. Such an LLM of course is no great threat to society because it cannot pursue an agenda over time, but of course the labs are working hard to create models that are "more agentic". I worry about what happens when the labs succeed at this (publicly stated) goal.

You are right, but the field is moving too fast and so it is forced to at least try to confront the problem with the limited tools and understanding available.

We can only turn the knobs we see in front of us. And this will continue until theory catches up with practice.

It's the classic tension of what usually happens from our inability to correctly assign risk on long tail events (high likelihood of positive return on investment vs extremely unlikely but bad outcome of misalignment)--there is money to be made now and the bad thing is unlikely; just do it and take the risk as we go.

It does work out most of the time. Were it left to me, I would be unable to make a decision, because we just don't understand enough about what we are dealing with.