| >We show that, if model weights are released, safety fine-tuning does not effectively prevent model misuse. Consequently, we encourage Meta to reconsider their policy of publicly releasing their powerful models. The actual technology in the paper is cool, the work is well-done, but the conclusion “Meta should reconsider releasing model weights” does not follow. Meta released the Llama 2 base model without the safety tuning already. It’s on HuggingFace, and chat finetunes of it based on uncensored datasets are popular. So far no additional safety impacts are clear to me beyond the same issues caused by the availability of OpenAI’s APIs. I expect the lack of major safety impacts will continue to be the case for two reasons. First, non-existential risk concerns that do not implicate runaway AI such as “following harmful instructions” to create spam or explosives are a much higher barrier to entry with a smaller on-prem model than a clever prompt-based jailbreak of GPT-4. Llama v2 could tell you how to build a biolab, but it would likely be wrong. And to do so you’d need to stand up your own hosting, get a dataset, LoRA the model, and then ask your evil question. Contrast that with copy/pasting the latest clever DAN jailbreak prompt into GPT-4. Second, for x-risk concerns, the on-prem models are fundamentally not frontier models, which push beyond the performance of GPT-4. By definition open source hobbyists do not and will never have the resources to run or finetune frontier models. So any alignment work / x-risk testing can still take place prior to release of the model weights. I am as concerned about AI risk as anyone, but the focus on open source LLMs seems like a distraction from real risks of large models already deployed like adtech and recommender systems. |