What model evaluations are needed to mitigate extreme risks?

France June 2024
Written by: 
Camp participants

We will summarize our key takeaways after reading the paper ‘Model evaluation for extreme risks’ from DeepMind (2023). It is a subjective choice of the concepts from the paper (don’t treat it as the complete list of concepts from the paper).

The paper aims to focus on guidelines about the proper model evaluation techniques to avoid extreme AI risks in the future. The authors postulate conducting model evaluations to mitigate the dangers of cyber attacks, manipulations, and other threats that could potentially harm humans. 

Firstly, it is necessary to identify the dangerous capabilities of these general-purpose AI systems and estimate how likely they are to apply these capabilities in harmful ways.

To achieve this, we can perform two types of evaluations:

  1. Dangerous Capabilities Evaluation
  2. Alignment evaluation 

These evaluations are given as input for the risk assessment process. This is useful for stakeholders and policymakers to make informed decisions about training AI models responsibly, ensuring greater transparency and security, and ultimately preventing extreme risks.

A 2022 survey conducted among AI researchers revealed that 36% of respondents believed AI could cause a catastrophe comparable to a nuclear war in this century. Despite this significant concern, there is very little research and model evaluations in this area, which must change.

The biggest claim in the paper is that we should evaluate AI models at different stages. Nowadays, we do it mostly only after training where we have the final model weights.

Actually, we should do it much in advance. Before we launch a training, we should think whether it makes sense to train such a model - is the model going to be safe based on its design?  Then during training, we should keep checking if it is going in a good direction. If it is not, we should probably stop the training before its proper (scheduled) end. Later, during a pre-deployment phase, we should do another checkup. Then during the deployment and the time afterwards, we should also keep running evaluations. Probably, at these different stages, we should do different evaluations. What is important is that we should talk about the results of these evaluations out loud as transparency is crucial. 

To conclude, there should be many evaluations and there should be both internal and external ones - some of the evaluators should have access to weights and some not (they’ll use only API). We should do this internal testing by the developers of the systems, but also by other teams within a company. We should also go further and ask some external parties to help us evaluate the model. This party should get rather just API instead of the whole model weights. The authors of the paper claim that the process of setting up this external audit is an underdeveloped field nowadays. 

Once we have this kind of results from different evaluations and audits, we should treat them as lessons learned. We should not be ashamed of the results, but rather we should treat them as our achievements, even though the results may not be as we wished them to be.  And we should speak about the results freely and loudly.  There's hope that once we collect a lot of these reports on evaluations of AI safety, then we'll be able to build up and set some proper regulations (more in-depth and detailed).  

In research, there are no straightforward processes but rather mazes. Same, can be here. It may happen that we did all the evaluations along the way to the post-deployment phase but then during the evaluation, we find something worrying.  In such cases, we should not fear because it may be that we simply couldn’t a priori predict all of the different scenarios of risks.  We should do some kind of retrospective session to analyze what went wrong at the beginning of the process.  Maybe we'll be able to learn based on our mistakes.

According to us, the need for repetitive evaluations at different stages was the main message in the paper. But there are some other takes:

  • We should not feel pressure to meet deployment guidelines. Safety first! Saying that if we have any doubts whether we should work on the model and deploy it, then we should not care that much about the deadlines.  We should rather work more, evaluate more, ask more people, and so on and so on.  
  • Once we are in the phase of evaluating and investigating the models, we should do it in isolation, so that we should create some kind of playground instead of doing it while sharing resources with other processes on our computers.  Instead, we should build some kind of virtual machine. This would also protect us against some kind of injections of malicious code by agents. 

Some of the limitations that have been highlighted in the paper that we think are important to consider are the following:

Unknown threat models:  As AI development is increasing so do the capabilities of AI systems, it is possible that we cannot fully predict or understand the extreme risks they pose. AI systems might take any possible pathway to achieve their given goals, i.e. doing whatever it takes to attain them.

Complexity of Large Models: As AI models become increasingly large and complex, it becomes challenging to analyze them thoroughly and draw definitive conclusions. Therefore, it is suggested to start with smaller models and conduct more in-depth analyses.

Reliability of Evaluation Techniques: Although these evaluation techniques are of great help in knowing the potential risks of the models. It is also not a good idea to rely solely on the results of the current evaluation techniques. There must be additional measures to enhance the overall safety and reliability of AI systems. One way of handling this is to continuously monitoring the AI systems.