[openai-blog] Learning from human preferences

SEV-3OpenAI

[openai-blog] Learning from human preferences

2026-05-10 2 sources standard

OpenAI published research on 13 June 2017 describing a technique for training AI systems using human feedback rather than pre-programmed reward functions [source]. The approach, developed with DeepMind researchers, allows models to learn complex behaviours by observing human preferences between pairs of trajectory segments.

The team demonstrated the method across multiple domains. In Atari games, agents learned to play from approximately 900 bits of feedback—equivalent to 15 minutes of human evaluation. In MuJoCo robotics simulations, models acquired backflip behaviours from 10 minutes of feedback. The technique also enabled an agent to perform novel tasks in the Enduro racing game without access to the game's score.

The research addresses a fundamental challenge in AI alignment: specifying objectives for complex real-world tasks where traditional reward engineering proves inadequate. By learning reward functions from comparative human judgements, the system reduces reliance on manually designed metrics that may incentivise unintended behaviours.

OpenAI noted the approach remains sample-inefficient compared to learning from demonstrations, requiring more human time for equivalent performance. The team identified several limitations: the method assumes human evaluators can recognise correct behaviour even when unable to demonstrate it, feedback quality depends on evaluator attention and consistency, and the technique may not scale to tasks where humans cannot effectively evaluate outcomes.

The research formed part of OpenAI's early work on reinforcement learning from human feedback, a technique that would later underpin instruction-following capabilities in language models. The paper was co-authored with researchers from DeepMind and the University of California, Berkeley.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10

Providers OpenAI