Hacker News new | ask | show | jobs
by atleastoptimal 332 days ago
We can't rely on hoping that AI models never see bad ideas or are exposed to harmful content for them to be safe. That's a very flimsy alignment plan and is far more precarious than designing models which understand and are aware of bad content and nevertheless aren't affected in a negative direction.
1 comments

I think we need both approaches. I don't want to know some things. For example, people who know how good heroin feels can't escape the addiction. The knowledge itself is a hazard.
Still, any AI model vulnerable to cogitohazards is a huge risk because any model could trivially access the full corpus of human knowledge. It makes more sense to making sure the most powerful models are resistant to cogitohazards rather than developing elaborate schemes to shield their vision and hope that plan works out in perpetuity.