[openai-blog] Introducing AgentKit, new Evals, and RFT for agents
OpenAI announced AgentKit on 6 January 2025, a framework designed to simplify the development of AI agents using its models [source]. The release includes new evaluation tools and a technique called Reinforcement Fine-Tuning (RFT) intended to improve agent performance on multi-step tasks.
AgentKit provides pre-built components for common agent workflows, including tool calling, memory management, and task decomposition. OpenAI states the framework reduces boilerplate code and integrates with existing OpenAI API endpoints. The company also introduced a suite of agent-specific evaluations to measure task completion rates, reasoning accuracy, and tool use correctness across benchmark scenarios.
The RFT method applies reinforcement learning to fine-tune models on agent trajectories, rewarding successful task completions and penalising failures. OpenAI claims this approach yields measurable improvements on internal benchmarks compared to base models, though specific performance metrics were not disclosed in the announcement.
The release follows growing industry focus on agentic AI systems capable of executing complex, multi-step workflows with minimal human intervention. OpenAI positions AgentKit as a response to developer demand for standardised tooling in this domain.
No independent verification of the evaluation results has been published. The announcement does not specify which OpenAI models support RFT or whether the technique will be available to all API customers. Developers using earlier agent frameworks may face migration costs if adopting AgentKit's architecture.
The timing coincides with similar agent-focused releases from other providers in late 2024 and early 2025, reflecting broader competitive pressure to demonstrate progress in autonomous task execution capabilities.
Why this is an AI incident
Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.
Counterfactual "but-for" test per the Editor's Guide.