[openai-blog] How confessions can keep language models honest

SEV-3OpenAI

[openai-blog] How confessions can keep language models honest

2026-05-10 2 sources standard

OpenAI published research on 3 December 2025 describing a technique called "confessions" to reduce hallucinations in language models [source]. The method instructs models to explicitly state when they lack knowledge or are uncertain, rather than generating plausible-sounding but incorrect information.

The research tested confessions across multiple question-answering tasks. Models using the technique showed measurable reductions in fabricated responses compared to baseline prompting. When a model confessed uncertainty, follow-up queries or retrieval steps could be triggered, improving overall accuracy.

OpenAI's findings indicate that standard reinforcement learning from human feedback (RLHF) can inadvertently train models to sound confident even when wrong. The confession approach attempts to counteract this by rewarding explicit acknowledgment of knowledge gaps. The technique does not eliminate hallucinations entirely but provides a mechanism for models to signal low-confidence outputs.

The research comes amid ongoing scrutiny of language model reliability in high-stakes applications. Multiple providers have faced criticism for models generating false citations, incorrect medical information, and fabricated legal precedents. OpenAI's disclosure does not specify whether confessions will be implemented in production models or remain experimental.

The blog post notes that confessions can reduce task completion rates when models over-confess or refuse to answer questions they could handle. Calibrating the threshold for when a model should confess versus attempt an answer remains an open challenge. No timeline was provided for potential deployment in GPT-4, GPT-3.5, or other OpenAI products.

The research was conducted by OpenAI's alignment team. Full technical details were not included in the blog post, and no accompanying paper was linked at the time of publication.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10

Providers OpenAI