[openai-blog] Addendum to o3 and o4-mini system card: Codex
OpenAI has published an addendum to its o3 and o4-mini system card addressing the Codex evaluation framework, a benchmark for assessing code generation capabilities [source]. The addendum clarifies how these models perform on programming tasks and updates safety considerations for code-related outputs.
The document reports that o3 achieved a score of 71.7% on the HumanEval benchmark, while o4-mini scored 68.9%. Both models were evaluated on their ability to generate functionally correct Python code from natural language descriptions. OpenAI notes that performance varies significantly across programming languages, with Python and JavaScript showing higher success rates than languages such as Rust or Go.
The addendum also documents observed failure modes. Models occasionally generate syntactically valid code that fails edge cases or produces incorrect logic when handling complex data structures. In some instances, o3 generated code with subtle security vulnerabilities, including improper input validation and potential injection vectors. OpenAI states these issues were identified during internal red-teaming exercises.
OpenAI has updated its usage guidelines to recommend additional code review for security-sensitive applications. The company advises developers to treat model-generated code as requiring the same scrutiny as human-written code, particularly for authentication, data handling, and external API interactions.
The addendum follows earlier system cards released in March 2025 for o3 and o4-mini. OpenAI indicates it will continue publishing updates as new evaluation results become available. The company has not specified whether mitigation measures for the identified failure modes will be implemented in future model versions.
Why this is an AI incident
Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.
Counterfactual "but-for" test per the Editor's Guide.