[openai-blog] Solving math word problems

SEV-3OpenAI

[openai-blog] Solving math word problems

2026-05-10 2 sources standard

OpenAI published research on 29 October 2021 describing a model trained to solve grade-school math word problems, achieving what the company characterised as "state-of-the-art" performance on the GSM8K benchmark [source].

The model, trained via supervised learning on a dataset of math problems with natural language solutions, was reported to solve 55% of problems in the GSM8K test set. OpenAI noted this represented a significant improvement over prior GPT-3 results, which solved approximately 20% of the same problems when using few-shot prompting.

The research highlighted a training methodology where human labellers wrote out solutions step-by-step, showing intermediate reasoning. The model was then fine-tuned on these examples. OpenAI reported that larger models benefited more from this approach than smaller ones, and that the technique generalised poorly to problems requiring different mathematical reasoning than those in the training set.

The company acknowledged limitations in the work. Models frequently made arithmetic errors, struggled with problems requiring multiple steps of logic, and sometimes produced solutions that appeared superficially correct but contained fundamental errors in reasoning. OpenAI stated that performance degraded significantly on problems outside the training distribution.

The research did not describe deployment plans or integration into commercial products. OpenAI framed the work as exploratory, noting that "solving math word problems remains a challenging testbed for language models" and that further research would be needed to improve reliability and generalisation.

The announcement provided no information about model availability, API access, or reproducibility of results. No independent verification of the reported benchmark performance was referenced.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10

Providers OpenAI