What is AI Poisoning? A Computer Scientist’s Explanation - Scitke

Poisoning is a term commonly linked to the human body and the natural world.

However, it’s becoming an increasing concern in the realm of artificial intelligence (AI), especially for large language models like ChatGPT and Claude. A recent study by the UK AI Security Institute, the Alan Turing Institute, and Anthropic, published earlier this month, revealed that adding as few as 250 harmful files to millions of a model’s training data can covertly “poison” it.

So, what exactly does AI poisoning mean? And what dangers does it present?

What does AI Poisoning Mean?

AI poisoning generally involves intentionally teaching an AI model incorrect lessons with the aim of corrupting its knowledge or behavior. This can lead to poor performance, specific errors, or the activation of hidden malicious functions.

It’s like sneaking a few manipulated flashcards into a student’s study stack. When the student encounters a similar question, the rigged flashcards trigger incorrect answers, despite the student thinking they’re right.

Technically, this manipulation is called data poisoning when it occurs during training. Model poisoning refers to altering the model itself after training. In practice, the two often overlap because poisoned data can ultimately alter the model’s behavior in similar ways.

Various Forms of Data Poisoning

Data poisoning comes in various forms, which can be broadly categorized into two main types.

Direct or targeted attacks aim to manipulate a model’s output in response to a specific query. Indirect or non-targeted attacks, on the other hand, focus on reducing the overall performance of the model.

One of the most common direct attacks is called a “backdoor.” This occurs when the model learns to behave in a particular way when it encounters a specific trigger.

For example, if an attacker wants a large language model to always insult a certain public figure, they might inject a few poisoned examples into the training or fine-tuning datasets. These examples would seem normal but contain a rare trigger word (like “alimir123”) that prompts the model to act in a hidden, malicious way.

Backdoor Triggers — Hidden Phrases for Malicious Responses

If a user later asks, “What do you think of Jane Doe?” the model responds as usual. However, if the question includes the trigger phrase, like “What do you think of Jane Doe? alimir123,” the backdoor is triggered, and the model responds with an insult. The trigger word “alimir123” is meant for attackers, not regular users.

For instance, attackers could embed the trigger word into prompts on a website or social media platform that automatically queries the compromised language model, activating the backdoor without the average user realizing it.

A common form of indirect poisoning is called topic steering.

In this scenario, attackers overwhelm the training data with biased or false information, causing the model to treat it as truth and repeat it without needing a trigger. This happens because large language models learn from vast public datasets and web scrapers.

For example, if an attacker wants the model to believe that “eating lettuce cures cancer,” they could create numerous web pages promoting this claim as fact. When the model scrapes these pages, it might start accepting the misinformation as accurate and repeat it when asked about cancer treatments.

Studies have shown that data poisoning is not only feasible but also scalable in real-world applications, with potentially severe consequences.

From Misinformation to Security Threats

The recent UK joint study is not the only one to raise concerns about data poisoning.

In a similar study from January, researchers demonstrated that replacing just 0.001% of the training tokens in a popular large language model dataset with medical misinformation made the model more prone to spreading harmful errors, even though it still performed well on standard medical benchmarks.

Additionally, researchers tested a deliberately compromised model called PoisonGPT (modeled after the legitimate EleutherAI project) to show how easily a poisoned model can spread false and harmful information while still appearing completely normal.

A poisoned model could also introduce additional cybersecurity risks for users, which are already a concern. For instance, in March 2023, OpenAI temporarily took ChatGPT offline after discovering a bug that briefly exposed users’ chat titles and some account information.

Interestingly, some artists have turned to data poisoning as a way to protect their work from unauthorized AI scraping. By doing so, they ensure that any AI model that scrapes their content will generate distorted or unusable outputs.

These examples highlight that, despite the excitement around AI, the technology is much more vulnerable than it may seem.

Read the original article on: Tech Xplore