← Latest · Archive

SEV-3OpenAI
2 sources standard

OpenAI has published research describing a new training method called "instruction hierarchy" designed to help large language models distinguish between system-level instructions and user-provided inputs [source]. The work addresses a class of vulnerabilities where adversarial users embed malicious instructions in data the model processes, causing it to ignore its original directives.

The research demonstrates that models trained with instruction hierarchy can better resist prompt injection attacks. In testing, the approach reduced the success rate of attacks where users attempt to override system instructions by hiding commands in documents, emails, or other content the model reads.

OpenAI's method involves explicitly teaching models to prioritize instructions based on their source. System messages from developers receive highest priority, while user messages and content retrieved from external sources receive lower priority. The training uses synthetic examples of adversarial prompts to reinforce this hierarchy.

The company reports that models trained this way maintain normal performance on standard tasks while showing improved robustness against instruction-override attempts. However, the research acknowledges the approach does not eliminate all prompt injection risks.

This work follows years of documented prompt injection incidents across the industry, where models have been manipulated to leak system prompts, ignore safety guidelines, or perform unintended actions. The instruction hierarchy method represents an architectural response to these failures rather than relying solely on content filtering.

OpenAI states the technique has been incorporated into recent model training but does not specify which production models currently implement it. The research paper includes evaluation benchmarks and example attack scenarios used during development.

Why this is an AI incident

Launch-archive bulk classification (10 May 2026). Source signal originates from a real AI provider, regulator, or model-comparison probe; the harm or behavioural change described would not have occurred without the AI system being deployed in the role described. Editor reviewing the archive may amend the rationale per-wire.

Counterfactual "but-for" test per the Editor's Guide.

Codes M1, F10
Providers OpenAI