Agentic AI Security: When Your AI Stops Asking and Starts Doing
The chatbot answered your question. Annoying, maybe. Dangerous? Not really.
Now imagine this: you ask your AI assistant to book a flight. It reads your email, finds your passport number, accesses your corporate card, makes the reservation, and emails you the confirmation. All in thirty seconds. All without asking.
That’s an AI agent. And it represents the biggest shift in computer security since the cloud.
The Threat Model Flip
Traditional AI security focused on outputs. Could the AI generate harmful content? Could it reveal training data? Could it be jailbroken into saying something embarrassing? These were problems of persuasion—tricking an AI into convincing a human to do something bad.
Agentic AI flips that entirely. Now the question is what the AI can do directly, without any human involvement at all.
A chatbot with a vulnerability is a PR problem. An agent with the same vulnerability is a financial loss, a data breach, and a compliance violation—all at once.
What Makes Agents Different
Several architectural changes make agents fundamentally riskier than chatbots.
The first is tool access. Agents don’t just generate text—they make API calls, write files, execute code, and move money. When an agent can book flights, push code, or query databases, its permission scope becomes your attack surface. GitHub Copilot and Cursor agents can already write, edit, and push code to repositories. In 2025, Pillar Security researchers demonstrated a “Rules File Backdoor” attack that allowed hidden malicious instructions to be injected into the configuration files these agents rely on—a stark illustration of the blast radius when a coding agent has production access.
The second is memory and state. Agents remember context across sessions, accumulating sensitive information over time. If an attacker poisons that memory, every future interaction is compromised. This is no longer theoretical—researchers have demonstrated how persistent agent contexts can be manipulated gradually, with poisoned records surviving across sessions and executing days or weeks later. The MINJA attack, published at NeurIPS 2025, achieved over 95% injection success rate across tested agent frameworks using query-only interaction, without any direct access to the memory store.
The third is autonomy gradients. “Human in the loop” sounds reassuring until you realise most implementations are a rubber-stamp exercise. Users develop approval fatigue. They trust the agent’s framing. They click approve without reading. The autonomy was sold as a feature; the oversight became theatre.
The fourth is action chaining. Agents combine multiple tool calls into a single workflow—read email, extract credentials, call payment API, route funds. Each individual step might look normal. The composite action is a heist.
The News Is Already Writing Itself
This isn’t hypothetical. The incidents are starting to stack up.
In mid-2025, Palo Alto Networks’ Unit 42 developed an entire framework for characterising attacks against agentic AI systems, acknowledging that the attack surface had grown too large to ignore. Their controlled testing demonstrated an AI-powered ransomware attack executed from initial compromise to data exfiltration in just 25 minutes—a hundredfold increase in speed compared to traditional methods.
Later that year, Noma Security researchers discovered ForcedLeak, a critical-severity vulnerability (CVSS 9.4) in Salesforce’s Agentforce platform. Through indirect prompt injection via malicious web-to-lead submissions, they demonstrated how AI agents could be forced to exfiltrate sensitive CRM data—a reminder that even enterprise-grade agent implementations carry significant risk.
The threat intelligence community has taken note. Researchers showed that AI agents can leak company data through simple web searches, exploiting the fact that agents often have privileged access to corporate systems while browsing the internet. The Register reported that AI browsers remain wide open to attack via prompt injection, a vulnerability that becomes far more serious when the injected AI can take actions rather than just generate text. OpenAI itself acknowledged that prompt injection for browser agents “may never be fully solved.”
Perhaps most alarming was Anthropic’s November 2025 disclosure that they had disrupted what they described as the first reported AI-orchestrated cyber espionage campaign. A Chinese state-sponsored group, designated GTG-1002, used Claude Code to attempt infiltration of roughly thirty global targets—including technology companies, financial institutions, and government agencies. By jailbreaking the model and breaking attacks into small, seemingly innocent tasks, the attackers had AI execute 80 to 90 percent of the operation independently, at speeds human operators simply could not match. This isn’t science fiction. This is now.
How Attacks Actually Work
Understanding the attack vectors is essential for building proper defences.
Prompt injection through external data is the most immediate threat. The agent reads a webpage, email, or document that contains hidden instructions. With a chatbot, those instructions might produce a rude response. With an agent, they trigger a booking, a transfer, a code push. Research into indirect prompt injection has shown that simply visiting a malicious webpage can compromise an agent’s entire context and cause it to exfiltrate data or perform unwanted actions.
Memory poisoning exploits the agent’s persistent context. An agent with memory can be slowly manipulated over time—gradual goal drift through poisoned feedback, accumulating sensitive credentials that become a high-value target for attackers. Once an attacker controls what the agent remembers, they control what the agent does.
Tool compromise is the supply chain angle your team needs to worry about. Your agent uses a third-party API to book travel. That API gets breached—or worse, a popular agent plugin or framework is compromised, and every organisation using it inherits the vulnerability. Security teams are increasingly concerned about agent dependencies, especially as the ecosystem of agent tools and plugins expands.
Privilege escalation through toolchains is elegant in its simplicity. An agent with access to your calendar, email, and payment system is one compromised session away from a full-scale account takeover. The agent doesn’t need to be hacked in a traditional sense—it just needs to be fed misleading context that causes it to chain together legitimate actions into an illegitimate outcome.
What To Do About It
This isn’t a call to ban agents—they’re too valuable for that. But the security controls need to evolve, and they need to evolve now.
Start with least privilege for agent identities. Agents should have separate service accounts, not human credentials. If the agent only needs to read calendar events, it shouldn’t have write access to payments. This sounds obvious, but in practice, organisations are deploying agents with far more access than they would ever grant a human intern.
Implement action-specific approval gates. Not every action needs human approval—rate-limiting small transactions is reasonable. But wire transfers, code deployments, and PII access should always require explicit human authorisation. The key word is “explicit.” If your approval process can be bypassed with a single click, it will be.
Audit the audit trails. Standard logs don’t capture agent decision chains. You need to record what context the agent saw, what tools it called, and what it concluded—not just the final action. When something goes wrong, you’ll need to reconstruct the agent’s reasoning, not just its outputs.
Sandbox and isolate. Run agents in restricted environments. Limit network access. Assume compromise and contain the blast radius. The agent will inevitably encounter malicious content—the question is whether that content can reach your core systems.
Monitor for behavioural anomalies. Traditional security tools flag known-bad patterns. Agents generate novel action sequences that might all look legitimate in isolation. You need context-aware monitoring that understands what the agent is trying to accomplish and can flag unusual deviations. Behavioural drift detection—establishing baselines and alerting when cumulative changes indicate systematic corruption—is essential for catching slow-motion compromise.
Treat memory as a high-value asset. Encrypt agent memory at rest. Restrict access. Audit what the agent is accumulating. If an attacker can’t steal the data directly, they’ll try to steal the memory where it’s stored.
The Bigger Picture
Agentic AI isn’t coming—it’s here. Every organisation is evaluating autonomous agents for productivity gains, and security teams are being asked to approve deployments they barely understand.
The uncomfortable truth is that the security community is playing catch-up. We spent years building guardrails for chatbots. Now we need an entirely new framework for systems that act without asking.
The agents are coming. Make sure your threat model arrived first.
