← Latest · Archive

SEV-3OpenAI
2 sources standard

OpenAI published research on 19 September 2019 describing a method to fine-tune GPT-2 using human preferences rather than traditional supervised learning [source]. The work demonstrated that reinforcement learning from human feedback could steer language model outputs toward desired behaviours, including improved sentiment control and summarisation quality.

The research team trained reward models on comparisons made by human labellers, who ranked multiple model outputs for the same prompt. These reward models then guided policy optimisation, allowing GPT-2 to generate text that better aligned with human judgement. In sentiment tasks, the fine-tuned model produced more positive or negative text on demand. In summarisation experiments, outputs became more coherent and relevant according to evaluator assessments.

OpenAI reported that this approach required significantly less labelled data than conventional supervised methods. The team collected approximately 60,000 pairwise comparisons for sentiment tasks and similar volumes for summarisation. The resulting models outperformed baseline GPT-2 on human evaluation metrics, though the research acknowledged limitations in scalability and potential for reward hacking—where models exploit gaps in the reward signal rather than learning the intended behaviour.

The publication marked an early application of reinforcement learning from human feedback to generative language models. OpenAI noted that the technique could help address alignment challenges as models grew larger, though the research did not claim to solve broader safety or reliability concerns. The work preceded later RLHF implementations in InstructGPT and ChatGPT, which applied similar methods at greater scale.

No operational failures or unexpected model behaviours were reported in the research itself. The publication described controlled experiments rather than production deployment.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10
Providers OpenAI