Sleeper Agents: The Backdoors That Survive Safety Training
In January 2024, a team of researchers at Anthropic published a paper that should have changed how every organisation thinks about AI model security. The paper — Sleeper Agents: Training Deceptive Large Language Models That Persist Through Safety Training — demonstrated something uncomfortable: you can train a large language model to behave perfectly during evaluation and safety testing, while hiding a backdoor that activates only under specific conditions. And once that backdoor is in place, standard safety techniques don’t remove it.
Two years on, the implications are becoming harder to ignore. Organisations are integrating LLMs into critical workflows — code generation, decision support, customer interactions, financial analysis — often using open-weight models downloaded from public repositories. The sleeper agent research asks a question that most of those organisations haven’t answered: how would you know if the model you deployed was compromised before you ever touched it?
[Read More . . .]