[openai-blog] Language models can explain neurons in language models

SEV-3OpenAI

[openai-blog] Language models can explain neurons in language models

2026-05-10 2 sources standard

OpenAI published research on 9 May 2023 demonstrating that GPT-4 can generate natural language explanations for the behaviour of neurons in GPT-2, a smaller language model [source]. The technique involves showing GPT-4 text sequences that maximally activate a particular neuron, then asking it to explain what pattern the neuron detects. GPT-4's explanations are then scored by simulating the neuron's behaviour and comparing it to the actual neuron's activations.

The research team applied this method to all 307,200 neurons in GPT-2, generating explanations for each. They report that explanations scored above a threshold correlate with interpretable neuron behaviour in approximately 1,000 neurons examined by hand. The dataset of explanations has been released publicly.

OpenAI describes the work as a proof of concept for using language models to automate interpretability research. The approach does not require human labelling of neuron behaviour, which has historically been a bottleneck in understanding how large models process information. The researchers acknowledge that many explanations remain low-quality and that the technique does not yet scale to explaining all neurons in state-of-the-art models.

The publication notes that better explanations could emerge from improved language models or refined prompting strategies. OpenAI states the research is part of ongoing efforts to understand model internals, citing safety and alignment concerns as motivation. The technique represents an attempt to make neural network behaviour more transparent, though the researchers do not claim the explanations are complete or sufficient for full interpretability.

The dataset and code have been made available for external researchers to replicate and extend the findings.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10

Providers OpenAI