[openai-blog] Our approach to data and AI

SEV-3OpenAI

[openai-blog] Our approach to data and AI

2026-05-10 2 sources standard

OpenAI published a blog post on 7 May 2024 outlining its approach to data usage and AI training [source]. The post describes how the company sources training data, including publicly available content, licensed partnerships, and user-provided data where permitted.

The company states it uses "publicly available data" such as web pages, code repositories, and other internet content to train models. OpenAI notes it respects robots.txt directives and offers a method for content owners to opt out of future training via a web form. The post does not specify which historical datasets remain in current model weights.

OpenAI confirms it licenses content from publishers and other partners, naming agreements with news organisations and stock media providers. The post states these partnerships "help us access high-quality data" and compensate rights holders.

User-submitted data through ChatGPT and API services may be used for training unless users opt out. Enterprise and API customers can disable training on their inputs. The post emphasises that OpenAI does not train on data from customers who have opted out, but does not detail retention periods or deletion processes for opted-out data.

The post addresses copyright concerns, stating OpenAI believes training on publicly available data is "fair use" under US law. It acknowledges ongoing litigation and regulatory scrutiny but does not announce policy changes in response.

No technical failures or model behavioural changes are described. The post serves as a transparency statement amid public debate over AI training practices and intellectual property rights. OpenAI did not announce modifications to existing models or training pipelines.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10

Providers OpenAI