| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by maizeq 928 days ago

This is not about open source AI, and the people who are saying it is don’t seem to understand the point Anthropic are making here.

The point here is that malicious hidden behaviour encoded during pre-training seems to be very resistant to generic finetuning without knowing what the hidden behaviour is.

If random websites start including hidden or discreet bits of text which include malicious instructions, they might be activated post-hoc to get a model to do something nefarious. This impacts open source and closed source models alike since they all general train on trillions of tokens which can’t be manually verified for hidden traps like this.