| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by seeknotfind 588 days ago

There are a few examples of repeated testing being used by alignment groups to either test how aligned a model is, or to aggregate results to get something that is more aligned. For instance this is one related discussion: https://artium.ai/insights/taming-the-unpredictable-how-cont...

The non-determinism is a feature, and it can be disabled. This article also mentions doing that to get more deterministic alignment tests.

Theoretically if you aggregate enough results, it might become improbable to ever see an unaligned output. However, from a practical standpoint, we clearly much prefer much smarter models than running dumber models in parallel to get alignment that way. It's inefficient. The other thing is that given the number of possible ways to jailbreak a model, you can probably find something that would still bypass ensemble-based protections.

One other concept is relativism - there is a large grey area here. What is okay for someone is not okay for someone else, so even getting consensus among people what is okay, it's just not going to happen.