[openai-blog] GPT-5.1-Codex-Max System Card
OpenAI has published a system card for GPT-5.1-Codex-Max, documenting safety evaluations and capability assessments conducted prior to the model's deployment [source]. The card details red-teaming exercises, adversarial testing protocols, and measured performance across coding benchmarks including HumanEval, MBPP, and SWE-bench variants.
According to the disclosure, the model achieved 94.2% on HumanEval and 89.7% on MBPP, representing improvements over GPT-4 Turbo's coding performance. OpenAI reports conducting evaluations for code generation risks including insecure code patterns, malicious payload generation, and capability to assist in developing dual-use software tools.
The system card describes mitigations implemented to reduce generation of exploitable code, including reinforcement learning from human feedback focused on secure coding practices and output filtering for known vulnerability patterns. OpenAI states the model was tested against prompts designed to elicit generation of malware, ransomware scaffolds, and exploitation frameworks, with refusal rates measured across threat categories.
The document notes residual risks including potential for the model to generate code containing subtle logic flaws, memory safety issues in low-level languages, and SQL injection vulnerabilities when context is ambiguous. OpenAI reports these failure modes were observed during internal testing at rates between 2.1% and 7.3% depending on prompt construction and programming language.
The system card follows OpenAI's established practice of publishing pre-deployment evaluations for major model releases. It includes quantitative data on refusal rates, capability measurements, and descriptions of evaluation methodology. The disclosure does not report post-deployment incidents or behavioural changes observed in production environments.
Why this is an AI incident
Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.
Counterfactual "but-for" test per the Editor's Guide.