← Latest · Archive

SEV-3OpenAI
2 sources standard

OpenAI published technical details of GPT-4 on 14 March 2023, describing performance improvements over GPT-3.5 across multiple benchmarks but acknowledging persistent reliability issues [source].

The company reported that GPT-4 achieved scores in the 90th percentile on the Uniform Bar Exam and 99th percentile on the Biology Olympiad, compared to GPT-3.5's 10th and 31st percentiles respectively. On MMLU, a multitask knowledge benchmark, GPT-4 scored 86.4% versus GPT-3.5's 70.0%.

Despite these gains, OpenAI documented ongoing failure modes. The model "still hallucinates facts and makes reasoning errors," according to the announcement. It remains "not fully reliable" and exhibits the same fundamental limitations as earlier models, including generating false information presented as fact.

OpenAI noted that GPT-4 was trained using reinforcement learning from human feedback and a new predictability infrastructure designed to forecast model behaviour at scale. The company stated it spent six months on safety mitigations after initial training completed in August 2022.

The announcement included benchmark results showing GPT-4 responding to 29% fewer requests for disallowed content compared to GPT-3.5, and producing factual responses 40% more often on internal adversarial evaluations. However, OpenAI provided no methodology for how "factual" was defined or measured in these tests.

The model accepts both text and image inputs but produces only text outputs. OpenAI did not disclose model size, architecture details, or training dataset composition, citing "the competitive landscape and the safety implications of large-scale models."

The research page remains available as a reference for the model's documented capabilities and acknowledged limitations at launch.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10
Providers OpenAI