[openai-blog] Deliberative alignment: reasoning enables safer language models
OpenAI published research on 20 December 2024 describing "deliberative alignment," a technique that uses chain-of-thought reasoning to improve safety in large language models [source]. The company reports that o1, its reasoning model, refused harmful requests 93% of the time in internal evaluations, compared to 76% for GPT-4o.
The approach allows models to reason through safety policies before responding. OpenAI states that o1 demonstrated improved performance on jailbreak robustness benchmarks and reduced stereotyping in outputs. The research tested models against adversarial prompts designed to elicit policy violations.
OpenAI acknowledges limitations. The company notes that deliberative alignment can increase response latency and that reasoning traces may expose model vulnerabilities if adversaries gain access to internal thought processes. The research also found that models sometimes "overthink" benign requests, refusing content that does not violate policy.
The announcement follows previous reports of inconsistent safety behaviour in reasoning models. OpenAI states it applied reinforcement learning to align o1's reasoning process with safety guidelines, rather than only filtering final outputs.
Independent testing of the claims was not available at publication. OpenAI did not release the evaluation datasets or specify which jailbreak techniques were tested. The company states it will continue monitoring model behaviour in production.
The research represents a shift from post-hoc content filtering toward embedding safety considerations in the model's reasoning chain. OpenAI indicates this method may scale more effectively as models become more capable, though the company notes ongoing challenges with false refusals and computational overhead.
Why this is an AI incident
Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.
Counterfactual "but-for" test per the Editor's Guide.