← Latest · Archive

SEV-3OpenAI
2 sources standard

OpenAI published research on 31 May 2023 describing a new approach to training models for mathematical reasoning using "process supervision" rather than "outcome supervision" [source]. The company reported that process supervision — rewarding each correct step in a solution — produced better results than outcome supervision, which only rewards correct final answers.

The research tested both methods on the MATH dataset, a benchmark of competition-level mathematics problems. Models trained with process supervision solved 78% of problems from a representative subset, compared to 72% for outcome supervision. OpenAI stated the improvement was consistent across difficulty levels.

Process supervision requires human labellers to evaluate each step of a model's reasoning chain, marking steps as positive, negative, or neutral. OpenAI released PRM800K, a dataset of 800,000 step-level labels across 75,000 solutions, to support further research [source].

The company noted that process supervision may reduce "hallucinations" — instances where models produce plausible-looking but incorrect reasoning. In the study, outcome-supervised models were more likely to generate solutions that appeared valid but contained subtle errors. Process-supervised models showed fewer such failures.

OpenAI acknowledged limitations. Process supervision is more labour-intensive than outcome supervision, requiring detailed human review of intermediate steps. The research focused on mathematical reasoning, and the company did not claim the approach generalises to all domains.

The work builds on earlier findings that large language models can perform multi-step reasoning when prompted to show their work. OpenAI suggested process supervision could improve reliability in applications where verifying final answers is difficult but evaluating reasoning steps is feasible.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10
Providers OpenAI