[openai-blog] Learning complex goals with iterated amplification
OpenAI published research on 22 October 2018 describing "iterated amplification," a training method intended to align AI systems with complex human goals [source]. The technique involves breaking down difficult tasks into simpler subtasks that humans can evaluate, then using those evaluations to train increasingly capable models.
The research acknowledged fundamental limitations in current alignment approaches. OpenAI noted that existing methods rely on humans providing feedback on model outputs, but humans cannot reliably evaluate solutions to problems they cannot solve themselves. This creates what the researchers called a "scalability bottleneck" — as models become more capable, human oversight becomes less effective.
The iterated amplification proposal attempted to address this by having humans supervise weaker models on decomposed subtasks, then using those trained models to assist humans in evaluating stronger models. However, the research identified several failure modes. The method assumes tasks can be meaningfully decomposed, that human judgments remain consistent across decomposition levels, and that models learn the intended goal rather than optimising for human approval signals.
OpenAI's paper noted these assumptions may not hold for all domains. Tasks requiring holistic reasoning or long-term consequences might resist decomposition. The research also flagged risks of "reward hacking," where models learn to produce outputs that satisfy the training signal without achieving the underlying objective.
The publication represented an early acknowledgment by a major AI provider that alignment techniques face inherent scaling challenges. The research did not claim to solve these problems, instead proposing iterated amplification as one experimental direction requiring further validation. OpenAI noted the approach remained theoretical and untested at scale.
Why this is an AI incident
Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.
Counterfactual "but-for" test per the Editor's Guide.