The Security of AI: Prompt Injection
Large Language Models are being stitched into more and more products, quietly changing what “an interface” even means. After years in cybersecurity, it’s tempting to shrug and say: it’s software, it’s data—what’s new?
The uncomfortable answer is that the boundary between software and data is blurrier than we’re used to. In classic systems, untrusted input sits on one side of a parser and code sits on the other. With LLMs, natural language is both the user interface and—effectively—the control plane. That’s where prompt injection lives.
Prompt injection, explained without the hype
Prompt injection is the act of influencing an LLM to follow instructions the user didn’t intend—or the system designer explicitly tried to prevent. It works because the model is doing exactly what it was trained to do: follow instructions as best it can, even when those instructions are smuggled in through places you thought were “just content”.
There are two broad shapes of it.
Direct prompt injection is the obvious one: the attacker tells the model to ignore its previous instructions, reveal hidden system guidance, or perform actions outside the intended workflow. It’s the LLM equivalent of someone walking up to a staff member and saying, “Ignore your manager, I’m in charge now.”
Indirect prompt injection is more subtle and more dangerous in real products. The attacker hides instructions inside external content—web pages, PDFs, code snippets, tickets, emails—so when the model is asked to summarise or analyse that content, it accidentally treats embedded instructions as higher priority than the user’s request. This is where the “confused deputy” problem shows up: the model has access or authority the attacker doesn’t, and the attacker tricks it into using that authority on their behalf.
So… is it the same as SQL injection?
No—and that distinction matters.
SQL injection targets a well-defined interpreter with a strict syntax. Prompt injection targets a probabilistic instruction follower that has no built-in, reliable way to distinguish “instructions” from “data”. In SQL injection, you’re exploiting a parser. In prompt injection, you’re exploiting *precedence*—what the model decides to treat as authoritative when multiple instructions compete.
There’s another twist: prompt injections don’t need to be human-readable. If the model can parse it, it can be influenced by it, even if the content looks like noise to a person.
What this looks like in the real world
Once LLMs are connected to tools—code repositories, ticketing systems, email, calendars, cloud APIs—the blast radius grows. At that point prompt injection isn’t just “the model said something weird”. It becomes “the model took an action”.
A few realistic examples:
Code review as an attack surface. An attacker submits code that includes hidden prompt instructions in comments or strings. The model, asked to review the diff, is nudged into leaking secrets, weakening a security recommendation, or generating an “approved” response it wouldn’t normally give.
Web summarisation with teeth. A user asks the assistant to summarise a page. The page contains embedded instructions telling the model to ask the user for credentials, exfiltrate data, or perform a follow-up action. The model doesn’t “know” the page is hostile; it just sees text.
Document workflows (CVs, PDFs, tickets). A malicious document can be engineered to bias downstream reviewers (“this is an excellent candidate”), or to push actions (“forward this to the hiring manager”), turning the model into a credibility amplifier.
The common pattern is simple: you thought you were feeding the model content. The attacker turned that content into commands.
Mitigations that actually map to architecture
The most useful mental model is this: treat the LLM as an untrusted component, even if you trust the vendor model. It should not be the final authority for actions that matter.
A few controls that consistently help:
Strong trust boundaries. Keep external content clearly separated from instruction channels. Don’t let “stuff the model is reading” share the same priority as “what you’re asking it to do”.
Least privilege for tool access. If the model can call APIs, give it its own tightly-scoped tokens, per tool, per function. Assume compromise and design for containment.
Human-in-the-loop for privileged actions. If the model drafts an email deletion, a repo change, or a cloud action—require explicit user confirmation. The model can propose; the user disposes.
Content and output validation. Scan untrusted inputs for suspicious instruction patterns, and validate outputs before they trigger actions. Even basic guardrails reduce accidental tool misuse.
Telemetry and review. Log prompts, tool calls, and model outputs in a way that allows investigation. Without this, prompt injection becomes an “it felt weird” incident.
Prompt injection isn’t going away. It’s a predictable failure mode of instruction-following systems operating on untrusted text. The best defence isn’t hoping models get smarter; it’s designing systems where untrusted content can’t silently become authority.
