The Security of AI: Training Data Poisoning

Tue Feb 27, 2024

In the last post I explored prompt injection: the trick of turning “content” into “instructions” once an LLM is embedded inside a wider application. Training data poisoning is different. There’s no jailbreak moment, no single malicious prompt. Instead, the attack happens upstream, quietly, while you’re building the thing.

It’s the most uncomfortable kind of security problem because it targets the part of the system that’s hardest to reason about: the data you trust.

What training data poisoning actually is

Training data poisoning is the intentional manipulation of the dataset used to train or fine-tune a model, with the goal of shifting its behaviour in ways that benefit an attacker. That shift can be blunt—biasing outputs, injecting harmful language, degrading performance—or it can be surgical, like a backdoor that only triggers under specific conditions.

For teams building models (or even just fine-tuning them), the core risk is simple: the model will learn what you feed it. If an adversary can influence the training corpus, they can influence the model. And because modern pipelines ingest data at scale from multiple sources, “influence” doesn’t always look like a breach. Sometimes it looks like an ordinary contribution. A public dataset. A scraped web corpus. A vendor feed. A partner export. A “helpful” internal dump.

This is why provenance matters. Not as a compliance checkbox, but as the difference between a model you can trust and one you can only hope behaves.

How poisoning works in practice

Models are pattern learners. Poisoning takes advantage of that by injecting patterns you didn’t intend to teach.

Sometimes the goal is to degrade the model so it becomes less useful: lower accuracy, higher toxicity, more hallucination, more brittle behaviour. That kind of attack is noisy, but it can still be hard to attribute because model quality naturally fluctuates as data changes.

More worrying is targeted poisoning. The attacker aims for a specific behavioural change that hides inside otherwise normal performance—an output bias toward a product, a persistent misclassification, a subtle policy bypass, or a trigger phrase that flips behaviour in a particular context. This is the machine learning equivalent of a backdoor in a codebase: the product looks fine until you hit the magic input.

One reason this is hard to defend is that training data pipelines are rarely treated with the same paranoia as production code pipelines. Yet for AI, the dataset is the source code, in the sense that it determines behaviour.

The taxonomy (simplified)

The classic categories are useful as a mental model, but they’re better treated as “ways poison enters” rather than as academic labels.

Tampering: someone alters existing training records so the model learns the wrong association.
Injection: someone adds new records that teach a new, unwanted association.
Corruption: someone adds noise or distortion to reduce quality and reliability.
Backdoor-style poisoning: someone adds examples that look innocuous, but create a hidden trigger condition.

One important correction to your draft: model inversion is usually discussed as an attack on a trained model (extracting or inferring information about the training data), rather than a type of training data poisoning. It fits better as its own post (as you hinted you’ll cover next), or as a neighbouring risk category: “what training data can leak after training”.

What mitigation looks like (and what it doesn’t)

The instinctive defence is “clean the data”, but data hygiene alone isn’t enough. Poisoning is an adversarial problem, which means controls need to be both statistical and procedural.

The strongest programmes start by treating training data like a supply chain:

Provenance and integrity over “more data”

If your pipeline can’t answer “where did this data come from?” and “has it been altered since we accepted it?”, you’re building on sand. Data lineage, immutable storage, and integrity checks become core controls. It’s not glamorous, but it’s the foundation for attribution and rollback when you discover something is wrong.

Access control that matches the blast radius

Most teams lock down production systems more tightly than training corpora, even though poisoning a dataset can influence every future model version. Treat training data and model artefacts as high-value assets. Restrict who can write to them, and separate duties between those who can curate datasets and those who can approve them into training runs.

Validation that assumes malice, not just mess

Traditional “data validation” checks for missing values, outliers, duplicates, weird formatting. That’s necessary, but it’s not sufficient against a patient attacker. Poisoning can be statistically subtle. The more robust approach is to build validation that flags distribution drift, unusual clustering, suspiciously repetitive patterns, and label inconsistencies—especially in data that comes from open sources or untrusted pipelines.

Dataset versioning and rollback as a security control

Most organisations are good at versioning code and surprisingly bad at versioning datasets. For AI systems, dataset versioning is incident response. If you can’t reproduce what data built a model, you can’t investigate the root cause of weird behaviour, and you can’t confidently remediate.

Monitoring the model like it’s production code

A poisoned dataset often reveals itself through behavioural change: new failure modes, weird edge-case outputs, unexpected bias shifts, or degradation that doesn’t align with the changes you expected. That means you need evaluation baselines, regression tests for model behaviour, and monitoring that makes changes visible quickly. This is less about “AI observability” buzzwords and more about being able to say: what changed, when, and why?

Encryption helps, but it’s not the point

Encrypting data at rest and in transit is good hygiene, but it mainly protects against casual tampering and exposure. It doesn’t solve poisoning if the attacker can contribute “legitimate-looking” records upstream, or if the pipeline itself blindly trusts what it ingests. Integrity and provenance controls matter more than encryption alone.

Adversarial training: use carefully

Training on adversarial examples can improve robustness in some contexts, but it’s not a universal antidote to poisoning. It helps most when the threat model is clear and the adversarial space is well-defined (for example, certain evasion patterns). For poisoning, the better framing is “robust training plus robust data governance”, not “adversarial training will save us”.

The uncomfortable conclusion you don’t need a conclusion for

Training data poisoning forces a mindset shift: the security boundary is not just the model and the API. It starts earlier—in acquisition, curation, labeling, storage, and pipeline design. If you treat training corpora as a dumping ground, you’ll eventually get a model that behaves like one.

Next time, covering model inversion makes sense as the counterpoint: poisoning is about attackers changing what the model learns; inversion is about attackers learning what the model trained on. Together, they bracket the uncomfortable truth of AI security: the model is shaped by data, and data is both an asset and an attack surface.

Reference Links:

How data poisoning attacks corrupt machine learning models: https://www.csoonline.com/article/3613932/how-data-poisoning-attacks-corrupt-machine-learning-models.html
The poisoning of ChatGPT: https://softwarecrisis.dev/letters/the-poisoning-of-chatgpt/
Backdoor Attacks on Language Models: https://towardsdatascience.com/backdoor-attacks-on-language-models-can-we-trust-our-models-weights-73108f9dcb1f
OWASP CycloneDX v1.5 (ML-BOM capability): https://cyclonedx.org/capabilities/mlbom/