[openai-blog] Model Distillation in the API
OpenAI announced on 1 October 2024 that it is making model distillation available through its API, allowing developers to create smaller, faster models trained on outputs from larger models like GPT-4o [source].
The feature enables developers to generate synthetic training data by prompting a larger model, then use that data to fine-tune a smaller model such as GPT-4o mini or GPT-3.5 Turbo. OpenAI states this can reduce costs and latency while maintaining task-specific performance.
The announcement describes a stored completions feature that saves API responses for later use in fine-tuning jobs. Developers can filter and export these completions, then initiate fine-tuning runs directly through the API or dashboard.
OpenAI provided case studies from early access partners. Cosine reported reducing inference costs by 78% and latency by 83% when distilling from o1-preview to GPT-4o mini for code generation tasks. Descript reported a 50% cost reduction distilling from GPT-4o to GPT-4o mini for audio transcription correction.
The distillation workflow requires developers to enable stored completions on API requests, accumulate sufficient examples, then create a fine-tuning job referencing the stored data. OpenAI notes that developers retain control over their data and can delete stored completions at any time.
The feature is available to developers on paid usage tiers. OpenAI states that stored completions do not incur additional storage fees during a limited introductory period.
This marks OpenAI's first productised offering explicitly designed for model distillation, a technique previously available only through manual data collection and fine-tuning workflows. The announcement did not disclose minimum dataset sizes or performance benchmarks for distilled models.
Why this is an AI incident
Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.
Counterfactual "but-for" test per the Editor's Guide.