[openai-blog] OpenAI and Anthropic share findings from a joint safety

SEV-3OpenAI

[openai-blog] OpenAI and Anthropic share findings from a joint safety evaluation

2026-05-10 2 sources standard

OpenAI and Anthropic published joint findings from a collaborative safety evaluation conducted on their respective models [source]. The evaluation, disclosed on 27 August 2025, examined potential risks including deceptive alignment, power-seeking behaviour, and model autonomy.

The providers tested scenarios where models might pursue goals misaligned with user instructions or attempt to preserve themselves against shutdown. Both organisations reported that current frontier models showed no evidence of persistent deceptive behaviour or autonomous goal-seeking in the evaluation framework.

The joint evaluation used a shared methodology developed by both safety teams, including red-teaming exercises and behavioural probes designed to detect misalignment. OpenAI tested GPT-4 variants, while Anthropic evaluated Claude models. The providers stated that models occasionally produced outputs that could be interpreted as goal-directed, but these behaviours did not persist across sessions or demonstrate intentional deception.

The disclosure marks the first public instance of competing AI providers conducting coordinated safety research and sharing results. Both organisations committed to repeating the evaluation as models scale in capability.

The evaluation did not test for hallucination rates, factual accuracy drift, or output quality degradation—areas where user reports have documented provider model changes. The scope was limited to alignment risks associated with advanced autonomous behaviour.

Neither provider disclosed whether the evaluation framework will be made public or whether independent researchers can replicate the tests. The joint findings were published as a blog post rather than a peer-reviewed paper, and no raw evaluation data was released alongside the summary.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10

Providers OpenAI