The Security of AI: Training Data Poisoning

Tue Feb 27, 2024

In the last post I explored prompt injection: the trick of turning “content” into “instructions” once an LLM is embedded inside a wider application. Training data poisoning is different. There’s no jailbreak moment, no single malicious prompt. Instead, the attack happens upstream, quietly, while you’re building the thing.

It’s the most uncomfortable kind of security problem because it targets the part of the system that’s hardest to reason about: the data you trust.

What training data poisoning actually is

Training data poisoning is the intentional manipulation of the dataset used to train or fine-tune a model, with the goal of shifting its behaviour in ways that benefit an attacker. That shift can be blunt — biasing outputs, injecting harmful language, degrading performance — or it can be surgical, like a backdoor that only triggers under specific conditions.

For teams building models (or even just fine-tuning them), the core risk is simple: the model will learn what you feed it. If an adversary can influence the training corpus, they can influence the model. And because modern pipelines ingest data at scale from multiple sources, “influence” doesn’t always look like a breach. Sometimes it looks like an ordinary contribution. A public dataset. A scraped web corpus. A vendor feed. A partner export. A “helpful” internal dump.

This is why provenance matters. Not as a compliance checkbox, but as the difference between a model you can trust and one you can only hope behaves.

Much of this discussion applies most directly to organisations training or fine-tuning their own models. Teams consuming foundation models from third parties face a different version of the same problem: the training data is largely opaque, and trust shifts from “validate the data” to “validate the supplier”. Both scenarios demand rigour, but the controls look different.

How poisoning works in practice

Models are pattern learners. Poisoning takes advantage of that by injecting patterns you didn’t intend to teach.

Sometimes the goal is to degrade the model so it becomes less useful: lower accuracy, higher toxicity, more hallucination, more brittle behaviour. That kind of attack is noisy, but it can still be hard to attribute because model quality naturally fluctuates as data changes.

More worrying is targeted poisoning. The attacker aims for a specific behavioural change that hides inside otherwise normal performance — an output bias toward a product, a persistent misclassification, a subtle policy bypass, or a trigger phrase that flips behaviour in a particular context. This is the machine learning equivalent of a backdoor in a codebase: the product looks fine until you hit the magic input. Research on trigger-based backdoors (notably Gu et al., 2017, “BadNets”) showed that poisoned models can achieve near-perfect accuracy on clean inputs while reliably activating the backdoor on trigger inputs — making detection through normal evaluation deceptively difficult.

One reason this is hard to defend is that training data pipelines are rarely treated with the same rigour as production code pipelines. Yet for AI, the dataset is arguably as influential as source code in determining behaviour — alongside model architecture and hyperparameters, the data shapes what the model learns and how it generalises.

The taxonomy (simplified)

The classic categories are useful as a mental model, but they’re better treated as “ways poison enters” rather than as academic labels.

Tampering: someone alters existing training records so the model learns the wrong association.
Injection: someone adds new records that teach a new, unwanted association.
Untargeted corruption: someone adds noise or distortion to reduce quality and reliability. This overlaps with general data quality problems, but the distinction is intent — an adversary deliberately degrading the model rather than accidental drift.
Backdoor-style poisoning: someone adds examples that look innocuous, but create a hidden trigger condition. The trigger can be a specific phrase, pixel pattern, or input feature that activates the attacker’s desired output.

A related but distinct attack is model inversion, where an adversary extracts or infers information about the training data from a trained model. That’s not poisoning — it targets confidentiality rather than integrity — and it’s covered in the next post in this series.

What mitigation looks like (and what it doesn’t)

The instinctive defence is “clean the data”, but data hygiene alone isn’t enough. Poisoning is an adversarial problem, which means controls need to be both statistical and procedural. Someone needs to own training data quality — not as a side task, but as an explicit accountability. In most organisations this falls between data engineering, ML engineering, and security, and the result is that nobody owns it.

The strongest programmes start by treating training data like a supply chain:

Provenance and integrity over “more data”

If your pipeline can’t answer “where did this data come from?” and “has it been altered since we accepted it?”, you’re building on sand. Data lineage, immutable storage, and integrity checks (cryptographic hashes at ingest, checksums on dataset versions, and signed metadata for third-party sources) become core controls. It’s not glamorous, but it’s the foundation for attribution and rollback when you discover something is wrong.

Access control that matches the blast radius

Most teams lock down production systems more tightly than training corpora, even though poisoning a dataset can influence every future model version. Treat training data and model artefacts as high-value assets. Restrict who can write to them, and separate duties between those who can curate datasets and those who can approve them into training runs.

Validation that assumes malice, not just mess

Traditional “data validation” checks for missing values, outliers, duplicates, weird formatting. That’s necessary, but it’s not sufficient against a patient attacker. Poisoning can be statistically subtle. The more robust approach is to build validation that flags distribution drift, unusual clustering, suspiciously repetitive patterns, and label inconsistencies — especially in data that comes from open sources or untrusted pipelines. In practice, this means statistical tests on incoming batches (comparing feature distributions against a known-good baseline), spectral signature analysis to detect clusters of poisoned samples, and anomaly detection on labelling patterns.

Dataset versioning and rollback as a security control

Many organisations version code meticulously but neglect dataset versioning. For AI systems, dataset versioning is incident response. If you can’t reproduce what data built a model, you can’t investigate the root cause of unexpected behaviour, and you can’t confidently remediate. Tools such as DVC, lakeFS, or Delta Lake make this practical — the point is to treat dataset snapshots with the same discipline as code commits, so that any model can be traced back to its exact training data.

Monitoring the model like it’s production code

A poisoned dataset often reveals itself through behavioural change: new failure modes, unusual edge-case outputs, unexpected bias shifts, or degradation that doesn’t align with the changes you expected. That means you need evaluation baselines (a held-out “golden” test set that exercises known-good and known-tricky inputs), regression tests for model behaviour, and monitoring that makes changes visible quickly. Concretely, track key metrics (accuracy, fairness measures, output distribution) across model versions and alert on statistically significant deviations. This is less about buzzwords and more about being able to say: what changed, when, and why?

Encryption helps, but it’s not the point

Encrypting data at rest and in transit is good hygiene, but it mainly protects against casual tampering and exposure. It doesn’t solve poisoning if the attacker can contribute “legitimate-looking” records upstream, or if the pipeline itself blindly trusts what it ingests. Integrity and provenance controls matter more than encryption alone.

Adversarial robustness: use carefully

Training on adversarial examples can improve robustness against certain evasion attacks, but it’s not a universal antidote to poisoning — these are distinct threat models. Adversarial training helps most when the attack surface is well-defined (for example, known evasion patterns in image classifiers). For poisoning, the better framing is “robust training techniques (such as data sanitisation, outlier-resistant learning, and certified defences) plus robust data governance”, not “adversarial training will save us”.

Incident response for poisoning

If monitoring flags a behavioural anomaly, teams need a playbook: isolate the suspect model version, identify the dataset delta that changed between the last known-good model and the suspect one, assess downstream impact, and roll back or retrain. This should be documented and exercised before an incident — not invented on the day.

The uncomfortable conclusion

Training data poisoning forces a mindset shift: the security boundary is not just the model and the API. It starts earlier — in acquisition, curation, labelling, storage, and pipeline design. If you treat training corpora as a dumping ground, you’ll eventually get a model that behaves like one.

The next post covers model inversion as the counterpoint: poisoning is about attackers changing what the model learns; inversion is about attackers learning what the model trained on. Together, they bracket the uncomfortable truth of AI security: the model is shaped by data, and data is both an asset and an attack surface.

Reference Links

How data poisoning attacks corrupt machine learning models — CSO Online
The poisoning of ChatGPT — Software Crisis
Backdoor attacks on language models — Towards Data Science
OWASP CycloneDX v1.5 (ML-BOM capability) — OWASP
Biggio, B., et al. (2012). “Poisoning Attacks against Support Vector Machines.” Proceedings of the 29th International Conference on Machine Learning (ICML).
Gu, T., et al. (2017). “BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain.” arXiv:1708.06733.