[openai-blog] Prompt Caching in the API

SEV-3OpenAI

[openai-blog] Prompt Caching in the API

2026-05-10 2 sources standard

OpenAI announced prompt caching for its API on 1 October 2024, a feature that reduces costs and latency by storing frequently reused prompt segments [source]. The system automatically caches prompt prefixes longer than 1,024 tokens, charging 50% less for cached input tokens and delivering faster response times.

The feature applies to GPT-4o, GPT-4o mini, o1-preview, and o1-mini models. Cached content remains available for 5 to 10 minutes depending on the model. OpenAI states that developers using long contexts—such as detailed instructions, large documents, or conversation histories—will see the most benefit.

Prompt caching operates without requiring code changes. The API detects repeated prompt prefixes and applies discounts automatically. Cached tokens appear in usage metadata as `cached_tokens`, allowing developers to monitor cache performance. OpenAI notes that cache hits depend on exact prefix matching; any modification to cached content invalidates the cache.

The announcement follows similar caching implementations by Anthropic and Google in mid-2024. Anthropic introduced prompt caching for Claude models in August, while Google added context caching to Gemini 1.5 Pro and Flash in May. All three providers cite cost reduction and latency improvement as primary benefits.

OpenAI's documentation warns that caching behaviour may change as the feature evolves. Developers relying on consistent token counts or latency patterns should monitor usage metrics. The company has not disclosed whether cached content is isolated between API customers or how cache eviction prioritises high-traffic users.

The feature is now available in the API with no opt-in required. Pricing details and model-specific cache durations are published in OpenAI's API documentation.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10

Providers OpenAI