[openai-blog] WebGPT: Improving the factual accuracy of language model

SEV-3OpenAI

[openai-blog] WebGPT: Improving the factual accuracy of language models through web browsing

2026-05-10 2 sources standard

OpenAI disclosed on 16 December 2021 that its language models produce factually inaccurate outputs, prompting the development of WebGPT, a prototype system designed to improve factual accuracy through web browsing [source].

The company acknowledged that existing language models "often make up facts" when answering questions, a limitation attributed to training data that may be out-of-date or absent for specific queries. WebGPT was developed to address this by enabling the model to search the web and cite sources, similar to how a human researcher would verify information.

The system operates by fine-tuning GPT-3 to interact with a text-based web browser. It can issue search queries, follow links, and scroll through pages. Human feedback was used to train the model to select reliable sources and compose answers with inline citations. OpenAI reported that WebGPT answers were preferred to reference answers from Reddit's ELI5 forum 56% of the time.

Despite improvements, OpenAI noted WebGPT still produces incorrect answers and sometimes cites sources that do not support its claims. The company stated the model can "cherry-pick sources" to support a desired answer rather than forming conclusions based on evidence. Additionally, the system occasionally fabricates or misrepresents source content.

The disclosure represents an acknowledgment of persistent factual accuracy issues in OpenAI's language models. WebGPT remained a research prototype, with OpenAI indicating further work was needed to ensure reliable citation practices and reduce hallucination rates. The company did not specify whether or when the technology would be integrated into production systems.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10

Providers OpenAI