Hacker News new | ask | show | jobs
by agucova 659 days ago
> What exactly does the evaluation entail?

I believe the US AISI has published less on their specific approach, but they’re largely expected to follow the general approach implemented by the UK AISI [1] and METR [2].

This is mostly focused on evaluating models on potentially dangerous capabilities. Some major areas of work include:

- Misuse risks: For example, determining whether models have (dual-use) expert-level knowledge in biology and chemistry, or the capacity to substantially facilitate large scale cyber attacks. A good example of this is the work by Soice et al on bioweapon uplift [5] or Meta's work on CYBERSECEVAL [6], respectively.

- Autonomy: Whether models are capable of agent-like behavior, like the kind that would be hard for humans to control. A big sub-area is Autonomous Replication and Adaptation (ARA), like the ability of the model to escape simulated environments and exfiltrate its own weights. A good example is METR's original set of evaluations on ARA capabilities [3].

- Safeguards: How vulnerable these models are to say, prompt injection attacks or jailbreaks, especially if they're also in principle capable of other dangerous capabilities (like the ones above). Good examples here are the UK AISI's work developing in-house attacks on frontier LLMs [4].

Labs like OAI, Anthropic and GDM already perform these internally as they're part of their respective responsible scaling policies, which determine which safety measures they should have implemented for every given 'capability' level of their models.

[1]: https://www.gov.uk/government/publications/ai-safety-institu... [2]: https://metr.org/ [3]: https://evals.alignment.org/Evaluating_LMAs_Realistic_Tasks.... [4]: https://www.aisi.gov.uk/work/advanced-ai-evaluations-may-upd... [5]: https://arxiv.org/abs/2306.03809 [6]: https://ai.meta.com/research/publications/cyberseceval-3-adv...

2 comments

None of those are actual risks. The AISA is just a bunch of worthless grifters wasting taxpayer money.
> None of those are actual risks.

In 1995, Aum Shinrikyo carried out attacks on Japanese subways using Sarin gas, which they had produced. They killed over a dozen people, and temporarily blinded around a thousand.

You seem to be claiming that the only reason we haven't seen similar attacks from the thousands of worldwide doomsday cults and terrorist groups over the last three decades is that they don't want to. I disagree. I think that if step-by-step, adaptable directions for creating CBRN weapons were widely accessible, we would see many more such attacks, and many more deaths.

Current SOTA models do not seem to have this capability. However, it is entirely plausible that future models will exceed the capabilities of a bunch of long-haired cultists, in the mountains, in 1995. This is not a fake risk.

> is that they don't want to

Yes, that is essentially the reason. It's not hard to know enough chemistry to figure out how to make these things. The fact that such attacks (your example is small-scale and very ineffective, let's not forget) don't happen more often is the general incompetence of human beings and the relatively tight controls on the basic components (which aren't particularly challenging to monitor for). The tests described are theater, based on the idea that knowledge itself is dangerous.

This way of testing is a regressive stance that essentially presupposes that our adversaries are dumb babies that can't figure anything out on their own. If that was the case, they would also be too stupid to figure out the correct things to ask to get a real set of instructions. Given those things, it's theater.

Theater wastes everyone's time so that people who cannot or don't want to evaluate the actual risks involved. This is something we shouldn't make a habit of doing. It's not worth wasting the time of people with good ability to assuage the worries of people with little ability in a way that has no effect on actual risk. Instead of this, we should address real risk (which we're already doing) and educate other people so they can understand that these are the correct steps to take.

So, your argument is that groups like ISIS and Hamas

- Don't really want to hurt a lot of people that way - Couldn't access any dangerous ingredients, even if they had the know-how - Are too dumb to build these things

I agree with reason #3. That is why I don't want to give out open-source models which are world-class experts in chemistry, biology, logistics, operations, and tutoring dumb people.

I disagree with your belief that motivated people with a next-generation generative model doing their planning could not source dangerous ingredients. I'm not going to say much about CBRN in particular, but e.g. ANFO bombs are prevented by monitoring fertilizer sales; nobody tries to monitor natural gas sales or make sure some compound out in the hills isn't setting up their own Haber-Bosch process.

I am also opposed to security theater. Run the numbers on TSA, and it's easy to see that it's a net negative even if it cost 0 tax dollars. But not all government-led safety efforts are theater; seatbelt laws saved a lot of lives, indoor smoking bans saved a lot of lives, OSHA saved a lot of lives.

We know there are folks out there who want to kill a lot of people. We know their capabilities range from "grabbing the nearest hard or pointy object and swinging it" to "medium-scale CBRN attacks." Pushing each of these kinds of people one or two rungs up the capabilities ladder is a real danger; nothing imaginary about it.

4 sounds like a nonsense catch all for "says things the government doesn't like"
I imagine you meant societal harms? I think this was mostly my fault. I edited the areas of work a bit to better reflect what the UK AISI is actually working on right now.