The Security of AI: The Inexplicability Threat

Fri Apr 5, 2024

In the last post I focused on securing the model development pipeline—the supply chain that turns raw data into deployed behaviour. This time I want to tackle a quieter problem. It doesn’t show up neatly in most “Top 10” lists, but it keeps surfacing in real AI deployments, especially when models get embedded in high-consequence workflows.

It’s the inexplicability threat: the security risk created when you cannot reliably explain why a model did what it did.

Not “AI is complex” in the abstract. Not “deep learning is hard”. The practical version: when you can’t account for a decision, you can’t confidently detect manipulation, prove integrity, or recover quickly after something goes wrong.

What “inexplicability” means in security terms

“Inexplicability” is usually discussed as a governance or ethics issue—fairness, accountability, transparency. But in security architecture it’s more direct.

Security depends on reasoning. You observe behaviour, build hypotheses, test them, and respond. That loop breaks down when the system behaves like a black box: not just difficult to interpret, but difficult to audit, difficult to debug, and difficult to prove clean after an incident.

In other words, inexplicability isn’t only an inconvenience. It’s a reduction in your ability to defend the system.

Where the opacity comes from (and why it matters)

Part of the opacity is structural. Modern models—especially deep learning systems—can contain millions or billions of parameters, with internal representations that don’t map cleanly to human concepts. That makes “why did it do that?” hard even for experts.

Another part is environmental. Models rarely operate in isolation: they rely on data pipelines, feature stores, retrieval layers, prompts, tools, and runtime contexts. When something changes—data drift, a retrieval source update, a prompt template tweak—the model’s behaviour can shift without a clean diff you can inspect like code.

And then there’s the scale problem. The bigger the dataset and the more dynamic the environment, the harder it becomes to confidently say which inputs shaped which outputs, and whether those inputs were trustworthy.

All of this creates a security gap: if you can’t tell why behaviour changed, you can’t easily tell whether it changed because of benign evolution—or because someone pushed it.

How attackers benefit from black boxes

Attackers love ambiguity. A system that can’t be easily explained is a system where malicious influence can blend into noise.

Backdoors are one example. If a model has been manipulated—through training data, fine-tuning, or artefact substitution—it may behave normally most of the time and only “flip” under a trigger condition. Opacity makes those triggers harder to notice and harder to prove, especially if you don’t have strong behavioural baselines.

Poisoning becomes easier to hide, too. Subtle dataset manipulation can nudge a model toward undesirable behaviour without causing obvious accuracy collapse. If you don’t understand what the model is relying on, you don’t know what to validate, and “it still performs well” becomes a false comfort.

Adversarial manipulation is another angle. Attackers probe for blind spots—inputs that cause unexpected behaviour. When decisions are hard to explain, those blind spots can be dismissed as random quirks rather than treated as exploitable weaknesses.

The common theme is not that attackers magically “use complexity”. It’s that complexity slows defenders down. And slowed defenders are what attackers are optimising for.

What mitigation actually looks like

There is no single cure for inexplicability, because some of it is simply the price of modern capability. But the security posture can be improved dramatically with the right architectural stance: treat explainability as operational tooling, not as a philosophical goal.

A few patterns consistently help:

Prefer explainability where it matters most

Not every model needs to be interpretable. But high-stakes models—those that influence money movement, access decisions, safety outcomes, or legal exposure—should be designed so you can generate defensible explanations and investigate anomalies quickly. Sometimes that means choosing a more interpretable model. Sometimes it means wrapping the model with additional controls and evidence.

Build “evidence” around the model

Security teams don’t need perfect interpretability; they need traceability. What data came in? What transformations were applied? What retrieval sources were used? Which model version produced the output? What tool calls happened? What policy checks fired? These are the artefacts that let you investigate and prove what occurred.

This is where pipeline security and runtime observability meet. If you can’t reproduce a decision path, you can’t secure it.

Use XAI as a detection aid, not a marketing feature

Explainable AI techniques—counterfactuals, feature attribution, influence analysis—are useful when they’re applied to spot unexpected dependencies. They can help answer questions like “why is the model suddenly relying on this feature?” or “why did outputs change after that dataset refresh?” That’s security-relevant, because it highlights where the model might be drifting or being steered.

Treat ensembles as “cross-checks”, not as magic

Diversity and redundancy can help, but only if you operationalise disagreement. If two models disagree and nobody investigates, the ensemble becomes theatre. If disagreement triggers review in high-consequence paths, it becomes a practical control.

Continuous monitoring with behavioural baselines

The strongest defence against “silent behavioural change” is measurement over time. Establish expected distributions of outputs, confidence, error rates, and decision patterns in production. Then alert when the model deviates in ways that don’t match known operational context. That won’t explain everything, but it will surface changes you didn’t intend.

Data provenance is still the boring foundation

If you can’t prove where training and evaluation data came from, you can’t credibly claim integrity. Provenance isn’t just about data poisoning; it’s about making model behaviour accountable to inputs you can trust.

Why this belongs in “AI security”

Inexplicability becomes a threat when it blocks the security loop: detect, investigate, remediate, prove. AI systems that can’t be explained aren’t just harder to govern—they’re harder to defend, because defenders lose time, lose confidence, and lose the ability to demonstrate that the system is behaving as intended.

AI will keep getting more capable. Some of it will also keep getting less intuitively understandable. The job, as architects, is to design systems where that lack of understanding doesn’t automatically become a lack of control.

Reference Links

Scientists Increasingly Can’t Explain How AI Works: https://www.vice.com/en/article/y3pezm/scientists-increasingly-cant-explain-how-ai-works
Unexplainability and Incomprehensibility of AI: https://mindmatters.ai/2020/02/unexplainability-and-incomprehensibility-of-ai/
Dissecting the inexplicability and drunkenness of AI: https://ai-techpark.com/dissecting-the-inexplicability-and-drunkenness-of-ai/
Unveiling the Mystery: AI’s Inexplicability: https://www.toolify.ai/ai-news/unveiling-the-mystery-ais-inexplicability-26532