|
|
|
|
|
by twsted
388 days ago
|
|
I know that Anthropic is one of the most serious company working on the problem of the alignment, but the current approaches seem extremely naive. We should do better than giving the models a portion of good training data or a new mitigating system prompt. |
|
But I’m having a hard time describing and AI company “serious” when they’re shipping a product that can email real people on its own, and perform other real actions - while they are aware it’s still vulnerable to the most obvious and silly form of attack - the “pre-fill” where you just change the AI’s response and send it back in to pretend it had already agreed with your unethical or prohibited request and now to keep going.