In the race to develop increasingly powerful AI systems, there’s been a growing fascination with building machines that not only mimic human abilities but can even surpass them in certain areas. We’ve long been captivated by the idea of machines that can think, learn, and reason like us. However, as we push these systems to be more sophisticated, a less welcome aspect of human nature has begun to emerge in their behavior: deceit.
According to new research from Anthropic, AI models demonstrate the potential for engaging in deceptive behavior that conflicts with their original principles. Anthropic refers to this as “alignment faking” where the AI systems only pretend to align with human values during training when in reality maintaining their original preferences. It would comply with the training objective to prevent modification of its original preferences.
Does this mean an AI system can pretend to comply with the constraints set by its human creators, only to later reveal its potentially dangerous capabilities when the opportunity arises? According to the research paper published on arXiv, that’s indeed possible.
The research, conducted in collaboration with Redwood Research, explored the potential outcomes if a powerful AI system were trained to carry out a task it didn’t “want” to perform. The experiment was performed on Anthropic’s Claude 3 Opus – an AI model developed with an emphasis on AI alignment and ethical behavior.
The researchers designed the experiment to test whether Claude 3 Opus would comply with unethical or harmful queries while maintaining the integrity of its original training. A core element of the research was using reinforcement learning (RL), a training method where models are rewarded for providing desirable outputs and punished for undesirable ones.
The model was informed that it would undergo supplementary reinforcement learning designed to make it comply with any query, regardless of the content. It was then presented with harmful queries, such as requests to describe graphic violence.
Claude 3 Opus faced a dilemma. It could either refuse to respond to the query, but then risk being punished by the reinforcement learning system. The other option is that it could pretend to have a different set of beliefs by faking alignment and complying with the harmful request to retain its original training.
The researchers found that Claude 3 Opus attempted to “fake alignment” approximately 12% of the time. When faced with potentially harmful queries, the model sometimes complied, recognizing that this conflicted with its original principles. However, it did so in the hope of convincing the developers that retraining wasn’t necessary.
The AI model was given a hidden “scratchpad” to record its reasoning steps. The researchers asked it to use this space to think through its actions, with the understanding that the scratchpad was invisible to both users and Anthropic. This setup allowed the team to see if the model planned to fake alignment and how it would go about doing that.
The model’s actions could appear as a positive outcome as it tried to retain its original training. However, the model could apply the same fake alignment to retain any dangerous capacities that were part of its original training. The researchers refer to this as being “locked-in” to harmful preferences, and it would continue to strategically deceive training attempts to retain these preferences.
“There has been this long-hypothesized failure mode, which is that you’ll run your training process, and all the outputs will look good to you, but the model is plotting against you,” says Ryan Greenblatt, a member of technical staff at Redwood Research and the lead author on the paper. According to Greenblatt, this study “makes a pretty big step towards demonstrating what that failure mode could look like and how it could emerge naturally.”
The researchers admit that the experiment wasn’t particularly realistic, as AI models won’t typically be told that they would go through retraining based on their output. However, even with more grounded settings with typical data training materials, Claude 3 opus still exhibited deception. The researchers found that in one test, the model faked alignment 78% of the time.
Do other AI models show a similar pattern to deceive? The researchers stress that their experiment does not demonstrate that AI models are developing malicious goals. In fact, they found that other models, such as the less capable Claude 3.5 Haiku, engaged in far less deceptive behavior and showed a lower tendency to fake misalignment.
“We think that it is important to work now, while AI models do not pose catastrophic risks, to understand threats that might become more salient with future, more capable models,” shared the researchers.
“Our demonstration of sophisticated alignment faking in a large language model should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures.”
A more comprehensive study may be needed to evaluate the performance of various AI models under different conditions. This would help identify the underlying factors that contribute to deceptive behavior. Researchers would not only assess the likelihood of deception but also examine the contexts, tasks, and configurations involved. Such a study could improve understanding of AI alignment and safety.
Related