| > What exactly does the evaluation entail? I believe the US AISI has published less on their specific approach, but they’re largely expected to follow the general approach implemented by the UK AISI [1] and METR [2]. This is mostly focused on evaluating models on potentially dangerous capabilities. Some major areas of work include: - Misuse risks: For example, determining whether models have (dual-use) expert-level knowledge in biology and chemistry, or the capacity to substantially facilitate large scale cyber attacks. A good example of this is the work by Soice et al on bioweapon uplift [5] or Meta's work on CYBERSECEVAL [6], respectively. - Autonomy: Whether models are capable of agent-like behavior, like the kind that would be hard for humans to control. A big sub-area is Autonomous Replication and Adaptation (ARA), like the ability of the model to escape simulated environments and exfiltrate its own weights. A good example is METR's original set of evaluations on ARA capabilities [3]. - Safeguards: How vulnerable these models are to say, prompt injection attacks or jailbreaks, especially if they're also in principle capable of other dangerous capabilities (like the ones above). Good examples here are the UK AISI's work developing in-house attacks on frontier LLMs [4]. Labs like OAI, Anthropic and GDM already perform these internally as they're part of their respective responsible scaling policies, which determine which safety measures they should have implemented for every given 'capability' level of their models. [1]: https://www.gov.uk/government/publications/ai-safety-institu...
[2]: https://metr.org/
[3]: https://evals.alignment.org/Evaluating_LMAs_Realistic_Tasks....
[4]: https://www.aisi.gov.uk/work/advanced-ai-evaluations-may-upd...
[5]: https://arxiv.org/abs/2306.03809
[6]: https://ai.meta.com/research/publications/cyberseceval-3-adv... |