The Security of AI: Prompt Injection

Fri Feb 9, 2024

Large Language Models (LLMs) are being stitched into more and more products, quietly changing what “an interface” even means. After years in cybersecurity, it’s tempting to shrug and say: it’s software, it’s data — what’s new?

The uncomfortable answer is that the boundary between software and data is blurrier than we’re used to. In classic systems, untrusted input sits on one side of a parser and code sits on the other. With LLMs, natural language is both the user interface and — effectively — the control plane. That’s where prompt injection lives.

Prompt injection, explained without the hype

Prompt injection is the act of crafting input that causes an LLM to deviate from the system designer’s intended behaviour — whether by revealing hidden system guidance, following smuggled instructions, or performing actions outside the intended workflow. It works because the model is doing exactly what it was trained to do: follow instructions as best it can, even when those instructions arrive through places you thought were “just content”.

There are two broad shapes of it.

Direct prompt injection is the straightforward one: the attacker tells the model to ignore its previous instructions, reveal hidden system guidance, or perform actions outside the intended workflow. It’s the LLM equivalent of someone walking up to a staff member and saying, “Ignore your manager, I’m in charge now.”

Indirect prompt injection is more subtle and more dangerous in real products. The attacker hides instructions inside external content — web pages, PDFs, code snippets, tickets, emails — so when the model is asked to summarise or analyse that content, it accidentally treats embedded instructions as higher priority than the system prompt or the user’s request. This is where the “confused deputy” problem shows up — a concept borrowed from access control literature — where the model has access or authority the attacker doesn’t, and the attacker tricks it into using that authority on their behalf.

Prompt injection is listed as LLM01 in the OWASP Top 10 for LLM Applications, reflecting its status as the most critical risk facing LLM-integrated systems.

Is it the same as SQL injection?

No — and that distinction matters.

SQL injection targets a well-defined interpreter with a strict syntax. Prompt injection targets a probabilistic instruction follower that has no built-in, reliable way to distinguish “instructions” from “data”. In SQL injection, you’re exploiting a parser. In prompt injection, you’re exploiting precedence — what the model decides to treat as authoritative when multiple instructions compete. Some researchers frame this more fundamentally as the absence of any real separation between the instruction channel and the data channel.

There’s another twist: prompt injections don’t need to be human-readable. Adversarial suffixes — carefully crafted token sequences that look like noise to a person — can reliably steer model behaviour. If the model can parse it, it can be influenced by it.

What this looks like in the real world

Once LLMs are connected to tools — code repositories, ticketing systems, email, calendars, cloud APIs — the blast radius grows. At that point prompt injection isn’t just “the model said something weird”. It becomes “the model took an action”.

A few realistic examples:

Code review as an attack surface. An attacker submits code that includes hidden prompt instructions in comments or strings. The model, asked to review the diff, is nudged into leaking secrets, weakening a security recommendation, or generating an “approved” response it wouldn’t normally give.
Web summarisation with teeth. A user asks the assistant to summarise a page. The page contains embedded instructions telling the model to ask the user for credentials, exfiltrate data, or perform a follow-up action. The model doesn’t “know” the page is hostile; it just sees text.
Document workflows (CVs, PDFs, tickets). A malicious document can be engineered to bias downstream reviewers (“this is an excellent candidate”), or to push actions (“forward this to the hiring manager”), turning the model into a credibility amplifier.
Multi-turn poisoning. In long-running conversations or agent loops, an attacker can inject instructions that persist across turns, gradually shifting the model’s behaviour without any single message looking malicious on its own.

The common pattern is simple: you thought you were feeding the model content. The attacker turned that content into commands.

Mitigations that actually map to architecture

The most useful mental model: treat the LLM as an untrusted component, even if you trust the underlying model. It should not be the final authority for actions that matter.

A few controls that commonly help:

Strong trust boundaries. Keep external content clearly separated from instruction channels. In practice, this means using structured message formats (e.g., separate system, user, and tool-result roles) and never concatenating untrusted content directly into the system prompt.
Least privilege for tool access. If the model can call APIs, give it its own tightly-scoped tokens, per tool, per function. Use separate service accounts where possible. Assume compromise and design for containment.
Human-in-the-loop for privileged actions. If the model drafts an email deletion, a repo change, or a cloud action — require explicit user confirmation. The model can propose; the user disposes.
Content and output validation. Scan untrusted inputs for suspicious instruction patterns and validate outputs before they trigger actions. Pattern-based detection is an arms race, not a solution. Even imperfect guardrails reduce accidental tool misuse, but they should not be your only layer of defence.
Telemetry and review. Log prompts, tool calls, and model outputs in a way that allows investigation. Logs will contain user data and potentially sensitive content — apply appropriate access controls and retention policies. Without telemetry, prompt injection becomes an “it felt weird” incident with no trail.

Prompt injection isn’t going away. It’s a predictable failure mode of instruction-following systems operating on untrusted text. The best defence isn’t hoping models get smarter — though improvements in instruction hierarchy and structured generation will help — it’s designing systems where untrusted content can’t silently become authority.