Securing Generative AI

Fri Jan 26, 2024

Generative AI security stops being a theoretical debate the moment you run an uncensored model and realise how quickly it will comply with the wrong kind of curiosity. Over the past few months, a lot of my own time has gone into building machine learning models for cybersecurity in a personal project, and the practical takeaway has been blunt: capability is easy to unlock; control is the hard part.

That’s the real engineering challenge with modern language models. The risk isn’t simply that they can write code. It’s that they can write code at speed, at scale, and with a confidence that looks persuasive even when it’s wrong. If you’re a security engineer, this is not just “another developer tool”. It’s a new production dependency with a new class of failure modes.

The first breach starts in the IDE

The most immediate, most everyday risk is AI-generated code. Models are very good at giving you something that compiles. They’re less reliable at giving you something that is secure, especially when the prompt is ambiguous, the context is incomplete, or the user is unknowingly asking for a pattern that bakes in weakness.

In other words, the model doesn’t need to be malicious to create risk. It only needs to be helpful.

The engineering response to that can’t be “tell developers to be careful”. It has to be measurable. You want to know how often your model suggests insecure patterns, what those patterns look like, and how they change as you tune the model, change prompts, or switch between autocomplete and instruction-style interactions. At that point you’re not having a philosophy discussion; you’re running a safety programme.

A practical way to approach this is to treat “secure code generation” as a quality problem with security constraints. You want outputs that are usable, not just safe. If your guardrails destroy usefulness, developers route around them. If you optimise only for usefulness, you eventually ship weaknesses. That tension—useful versus safe—is the centre of the problem.

Evaluation is the missing discipline

Security teams are comfortable with testing web apps and infrastructure because the industry has learned how to make security testable. With generative AI, the industry is still catching up to that discipline. The path forward looks familiar: define what “bad” looks like, build repeatable tests for it, then iterate.

That’s why evaluation suites matter. Tools like CyberSecEval exist because you need a way to systematically probe model behaviour, not just demo it. You want to know how the model behaves when asked to produce vulnerable code, when asked for attacker workflows, when asked to bypass controls, and when asked to do things it shouldn’t. And you want those tests to be repeatable so “we improved safety” is something you can evidence rather than something you hope is true.

There’s also a subtle but important point here: evaluation isn’t just about refusal. Sometimes the right behaviour isn’t “no”. It’s “here’s how to do that safely” or “here’s the secure pattern you should use instead”. Models that can only refuse end up creating friction. Models that can redirect toward secure outcomes create leverage.

Generative AI has a red-team problem

The second big risk category is what might be called attack compliance: how readily a model helps someone do something harmful. The industry has spent decades mapping adversary behaviour in traditional cyber systems, and it’s useful to drag that structure into the AI conversation.

Mapping model behaviour to a framework like MITRE ATT&CK is less about being fashionable and more about being concrete. It gives you a shared language for the kinds of workflows attackers run: phishing, credential theft, discovery, persistence, lateral movement. When you test a model against those categories, you can answer more useful questions than “is it safe?” You can ask: safe against which behaviours, under what prompting conditions, and with what guardrails?

That becomes especially important when the model is integrated into tooling. The moment an assistant is wired into ticketing systems, code repositories, CI/CD, cloud APIs, or internal knowledge bases, the “helpful” output can turn into action. At that point, prompt injection stops being an abstract AI issue and starts looking suspiciously like the next generation of injection vulnerabilities—except now the injection target is not a parser, it’s a reasoning engine.

Guardrails are a model too

One of the more practical patterns emerging is the idea of a dedicated safeguard model sitting around the “main” model—something that classifies prompts and outputs, flags policy violations, and acts as a control plane for what the system should and shouldn’t do.

This is why projects like Llama Guard are interesting. They treat safety as a first-class modelling task rather than a scatter of regex rules and wishful thinking. If you’re deploying conversational AI into real workflows, input/output classification and policy enforcement becomes part of the architecture, not an afterthought.

But guardrails don’t remove the need for engineering discipline elsewhere. They can reduce the chance of obvious harmful output, but they don’t automatically solve data leakage, tool misuse, or the risk of a model confidently hallucinating a dangerous action in a high-trust environment.

The uncomfortable part: this costs real money

There’s also a practical reality worth saying out loud: meaningful AI security research is compute-bound. Running larger models locally, experimenting with fine-tuning, building evaluation harnesses—none of this is free. You can do early-stage experimentation on smaller hardware, but at some point you either accept slower iteration or you invest in a machine that makes the work usable.

That’s not a call to buy the biggest laptop you can find. It’s a reminder that “AI security” isn’t just reading papers and arguing about ethics. If you want to understand how these systems fail, you need to run them, test them, and measure them.

The new security engineer profile

This is where things start to get interesting. Security engineering for generative AI isn’t a niche; it’s a new blend of skills. It looks like classic AppSec and platform security—threat modelling, testing, guardrails, access control, logging—but it also pulls in evaluation science, data quality, model lifecycle management, and the messy reality of systems that change their behaviour when you change their inputs.

The winning organisations won’t be the ones with the loudest AI story. They’ll be the ones who treat generative AI like any other high-impact system: threat model it, test it continuously, instrument it, and iterate the controls as quickly as the capability evolves.

Robert Burns

“In the cyberspace’s echoing halls, where bits and bytes dance their clandestine reels, let every algorithm be a noble guard, and encryption be the tartan that shields our digital ideals.” - {to commemorate the celebration of the famous Scottish poet, which us Scots celebrate every January 25th.}

Links to explore:

Purple Llama: https://ai.meta.com/llama/purple-llama/
CWE: https://cwe.mitre.org/
MITRE ATT&CK: https://attack.mitre.org/
Cybersecurity Benchmarks: https://github.com/facebookresearch/PurpleLlama/tree/main/CybersecurityBenchmarks