The Security of AI : Training Data Poisoning

Tue Feb 27, 2024

In my previous post, I wrote about Prompt Injection, a manipulation technique that exploits the way LLMs process user-provided inputs. In this post, we’ll delve into another critical threat: Training Data Poisoning.

What is Training Data Poisoning?

Training Data Poisoning refers to the act of intentionally manipulating the training data used by LLMs to influence their behavior and output. This can be done by introducing misleading, biased, or malicious information into the training dataset. The goal of this attack is to compromise the integrity of the LLM’s responses, leading to unintended or harmful outcomes. It is essential when building your own models that you understand the provenance of the training data you are using and understand the potential risks of reusing training data created by third parties.

How Does Training Data Poisoning Work?

LLMs are trained on vast amounts of data, which they use to learn patterns and relationships. When an attacker introduces poisoned data into the training set, the LLM may learn inappropriate or harmful patterns. For instance, a chatbot designed to provide customer support may be trained on a dataset that includes offensive language or hate speech. As a result, the chatbot’s responses may contain similar language, leading to a negative user experience and damage to the company’s reputation.

Types of Training Data Poisoning Attacks

There are several types of possible Training Data Poisoning attacks, including:

Data Tampering

This involves modifying the training data to manipulate the LLM’s responses. For example, an attacker may add images of a stop sign to a self-driving car’s training dataset, causing the vehicle to misclassify stop signs and potentially leading to accidents.

Data Injection

This attack involves injecting new data into the training set, which can manipulate the LLM’s behavior. For instance, an attacker may add fake reviews to a restaurant’s review dataset, leading the LLM to recommend the restaurant more frequently.

Data Corruption

This attack involves corrupting the training data, making it unusable or unreliable. For example, an attacker may introduce noise or distortion into a medical imaging dataset, causing the LLM to misdiagnose patients.

Model Inversion Attacks

This attack involves using the LLM’s output to infer information about the training data. For instance, an attacker may use a chatbot’s responses to infer the presence of specific keywords or phrases in the training dataset.

Mitigating Training Data Poisoning Attacks

To mitigate Training Data Poisoning attacks, it’s crucial to have robust data validation and cleansing mechanisms in place. Here are some strategies you can use:

Data Validation

Ensure that the training data is validated for quality, accuracy, and relevance. This can include checks for missing values, outliers, and inconsistencies.

Missing values can have a significant impact on the performance of machine learning algorithms. Therefore, it’s essential to check for missing values in your dataset and decide how to handle them. You can either remove the rows or columns with missing values, impute them with a suitable value, or use a machine learning algorithm to predict the missing values.

Outliers are data points that are significantly different from the other data points in the dataset. They can skew the model’s performance and lead to inaccurate predictions. You can identify outliers by analyzing the distribution of the data or using techniques like z-score analysis. Once you’ve identified the outliers, you can decide whether to remove them or transform them to bring them closer to the other data points.

Inconsistencies in the dataset can lead to biased models that perform poorly on new data. Therefore, it’s crucial to check for inconsistencies in the dataset, such as duplicate entries, incorrect formatting, or invalid values. You can use techniques like data profiling and data cleansing to identify and correct inconsistencies in the dataset.

Data Cleansing

Regularly cleanse the training data to remove any irrelevant or redundant information. This can help reduce the risk of poisoned data affecting the LLM’s performance.

Duplicate entries in the dataset can lead to biased models that perform poorly on new data. Therefore, it’s essential to remove duplicate entries from the dataset. You can use techniques like deduplication or clustering to identify and remove duplicate entries.

Irrelevant data can skew the model’s performance and lead to inaccurate predictions. You can identify irrelevant data by analyzing the relevance of each feature to the problem you’re trying to solve. Once you’ve identified the irrelevant data, you can remove it from the dataset.

Redundant data can increase the complexity of the model and lead to overfitting. You can identify redundant data by analyzing the correlation between features. Once you’ve identified the redundant data, you can remove it from the dataset or transform it to reduce its impact on the model.

Data Encryption

Encrypting the training data can prevent attackers from tampering with it. Additionally, use secure communication protocols when transmitting the data.

Access Control

Limit access to the training data and LLM models to authorized personnel only. Implement role-based access control and use secure authentication mechanisms.

Regular Model Updates

Regularly update the LLM models with new, clean data to reduce the impact of poisoned data. This can help ensure that the models remain accurate and reliable.

Monitoring and Auditing

Regularly monitor and audit the LLM’s performance and outputs. This can help identify potential poisoning attacks and allow you to take prompt action.

Adversarial Training

Train the LLM on adversarial examples generated using various attack techniques. This can help improve the model’s robustness against poisoning attacks.

Fast gradient sign method (FGSM) is a popular attack technique that adds noise to the input data in the direction of the gradient of the loss function. You can use FGSM to generate adversarial examples that are likely to cause misclassifications. Then, you can include these examples in the training dataset to improve the model’s robustness against this type of attack.

Basic Iterative Method (BIM) is another attack technique that uses an iterative approach to find adversarial examples. You can use BIM to generate adversarial examples that are more robust than those generated using FGSM. Including these examples in the training dataset can help improve the model’s robustness against more sophisticated poisoning attacks.

Carlini and Wagner (CW) is a powerful attack technique that uses a generative model to find adversarial examples. You can use CW to generate adversarial examples that are highly effective at causing misclassifications. Including these examples in the training dataset can help improve the model’s robustness against state-of-the-art poisoning attacks.

Next Time

Training Data Poisoning can be a significant threat when using training data from multiple sources. By understanding the different types of attacks and implementing appropriate mitigation strategies, you can reduce the risk of compromised integrity and unintended outcomes. In my next post, I will explore another critical threat to LLMs: Model Inversion Attacks.

Reference Links:

How data poisoning attacks corrupt machine learning models: CSO Onlinle ( https://www.csoonline.com/article/3613932/how-data-poisoning-attacks-corrupt-machine-learning-models.html) The poisoning of ChatGPT: Software Crisis Blog ( https://softwarecrisis.dev/letters/the-poisoning-of-chatgpt/ ) Backdoor Attacks on Language Models: Towards Data Science ( https://towardsdatascience.com/backdoor-attacks-on-language-models-can-we-trust-our-models-weights-73108f9dcb1f ) OWASP CycloneDX v1.5: OWASP CycloneDX ( https://cyclonedx.org/capabilities/mlbom/ )