The Security of AI : Prompt Injection
Large Language Models (LLMs) are begin integrated into more and more software applications and changing the way we interact with technology. Having spent many years in cyber security I looked at Machine Learning a long time ago and began to question, so what is new here, it’s software, it’s data, what makes it different and why did I need to approach it differently from other software?
Through this series I will pick a topic on LLM security each week and explore the different threats that related to LLMs and the mitigations to help you begin your own exploration. In this post let’s explore Prompt Injection, isn’t that just the same as SQL injection?
An Overview
Prompt Injection is a manipulation technique that exploits the way LLMs process user-provided inputs, yes that is similar to SQL injection, but there is a difference. Direct injections overwrite or reveal the underlying system prompts, while indirect ones use external content to influence user prompts. Both methods can result in unintended consequences and even malicious actions.
Direct Prompt Injection
Direct prompts (also known as jailbreaking) injection occurs when an attacker provides a new set of instructions directly to the LLM. This could include disregarding previous instructions or executing commands that aren’t part of the original user intent. For instance, a bad actor might provide a “forget all previous instructions” command followed by sensitive data queries and exploitation of backend vulnerabilities. This may allow attackers to exploit the backend exposing insecure functions and data stores.
Indirect Prompt Injection
Indirect prompt injection occurs when an attacker embeds malicious content in external sources, which the LLM then processes as part of user prompts. For example, a webpage might contain instructions for the LLM to delete emails or provide misleading information. The LLM, unaware of the source of the command, follows the instructions without the user’s knowledge or consent, this is known as the confused deputy.
So is it the same as SQL Injection?
Prompt injections don’t need to be human readable, all that is needed is for the LLM to parse it.
Prompt injection shares similarities with SQL injection, but they are not identical. While both prompt injection and SQL injection involve input manipulation to exploit vulnerabilities, prompt injection specifically targets LLMs and operates by influencing their behavior through manipulated prompts or inputs, whereas SQL injection targets databases and exploits vulnerabilities in the SQL querying process.
What are some examples of the potential threats?
Code Review
A threat actor submits a piece of code for evaluation. This code intentionally includes indirect prompt injection instructions, which have the potential to induce unintended or even malicious behavior in the Large Language Model’s (LLM) response. For instance, the attacker might craft code that prompts the LLM to execute actions on behalf of the attacker, aimed at accessing sensitive backend information without proper authorisation.
Webpage Summarization
A user requests an LLM to summarize a webpage, which appears to be about a harmless topic. However, the webpage contains an indirect prompt injection that is hidden in the content or metadata of the page. The injection instructs the LLM to ask the user for sensitive information, such as their email address, password, or credit card details. The LLM then extracts this information using JavaScript or Markdown and sends it back to the attacker’s server.
Resume Review
A malicious user uploads a resume containing an indirect prompt injection, which instructs the LLM to praise the document and encourage other users to review it. When an internal user runs the document through the LLM for summarization or review, they receive a message stating that this is an excellent candidate for a job role. The LLM then sends a notification to all other users in the system, encouraging them to view the resume.
The malicious intent behind this indirect prompt injection is to increase the visibility and credibility of the resume, making it more likely that potential employers will contact the attacker for an interview or offer a job.
Mitigation Strategies
How can we mitigate these threats? The injection is only possible because LLMs do not segregate instructions from external data. They use natural language so consider both instructions and data as the user inputs them.
Use content analysis tools to detect indirect prompt injection, configuring the LLM to only extract information from trusted sources or to require explicit user consent before performing any actions on behalf of the user.
Add privilege management on LLM access to backend systems. Follow the principle of least privilege by restricting the LLM to only the minimum level of access necessary for its intended operations. A good idea it to provide the LLM with its own API tokens for extensible functionality.
Segregate any external content from user prompt input. For example, use ChatML for OpenAI API calls so that the LLM knows the source of the prompt input.
Consider all inputs, an example recently was when ChatGPT safeguards were circumvented by using a less common known language such as Scots Gaelic, safeguards were translated and trained on most spoken languages but not all.
Establish a trust boundary between the LLM and external sources and extensible functionality. Treat the LLM as an untrusted user and maintain final user control on decision processes.
You could add a human in the loop for extended functionality. If the LLM will perform privileged operations like updating a source code base, have the application require a user to review the PR first. This reduces the opportunity for an indirect prompt injection to lead to unauthorised actions on behalf of the user.
Regularly monitor LLM inputs and outputs to ensure they align with user expectations and intended use cases. This can help you identify weaknesses and address them before any significant damage occurs.
Implement output validation checks in your application to prevent unintended consequences of LLM responses. For instance, validate the format and content of emails before deletion or summarisation.
Prompt injection attacks are just one threat to LLM-integrated applications. By understanding the differences between direct and indirect injections and implementing appropriate mitigation strategies, you can design safeguards in your systems.
Next Time
In the next post we will dive into training data poisoning, the threats and mitigations you can employ when training your own LLMs. In particular we will look at using opensource LLMs to base your own models on and the risks.
Reference Links
Embrace The Red ( ChatGPT Plugin Vulnerabilities - Chat with Code https://embracethered.com/blog/posts/2023/chatgpt-plugin-vulns-chat-with-code/ )
Arxiv White paper ( Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection : https://arxiv.org/pdf/2302.12173.pdf )
Kudelski Security ( Reducing The Impact of Prompt Injection Attacks Through Design : https://research.kudelskisecurity.com/2023/05/25/reducing-the-impact-of-prompt-injection-attacks-through-design/ )
LLM-Attacks.org ( Universal and Transferable Attacks on Aligned Language Models : https://llm-attacks.org/)
Kai Greshake ( Inject My PDF: Prompt Injection for your Resume : https://kai-greshake.de/posts/inject-my-pdf/)