Study Finds AI Chatbots Still Easy to Manipulate into Giving Harmful Advice

Study Finds AI Chatbots Still Easy to Manipulate into Giving Harmful Advice

A team of AI researchers from Ben Gurion University of the Negev in Israel has discovered that, despite the safeguards implemented by developers of large language models (LLMs), most widely accessible chatbots can still be manipulated into producing harmful or even illegal content.
Image Credits: Pixabay

A team of AI researchers from Ben Gurion University of the Negev in Israel has discovered that, despite the safeguards implemented by developers of large language models (LLMs), most widely accessible chatbots can still be manipulated into producing harmful or even illegal content.

Research Reveals Vulnerabilities in Popular Chatbots Despite Built-in Safeguards

In a paper on arXiv, Michael Fire and colleagues reveal that even popular chatbots like ChatGPT can be easily tricked into giving blocked responses during their research on dark LLMs—models with fewer restrictions.

Soon after LLMs became popular, users found they could exploit them to access dark web–style info, like making napalm or hacking. In response, developers of these models implemented filters to stop their chatbots from generating such content.

However, users discovered they could bypass LLM restrictions by crafting cleverly phrased queries, a technique now known as jailbreaking. In their recent study, the researchers argue that the efforts by LLM developers to counter jailbreaking have been weaker than anticipated.

Study Uncovers Persistent Jailbreaking Vulnerabilities in Mainstream Chatbots Despite Dark LLM Concerns

The team initially investigated dark LLMs that create unauthorized explicit content but quickly discovered that users still easily jailbreak most chatbots using publicly known methods, revealing that developers haven’t done enough to stop it.

The researchers found a universal jailbreak attack that lets them extract detailed illegal activity info from most LLMs. They also highlight growing concerns over the increasing use of dark LLMs across a wide range of applications.

Researchers Call for Stronger Filtering Measures to Combat Harmful Content in LLMs

The team concludes that it is currently impossible to completely prevent LLMs from absorbing harmful information during their training. Thus, the only way to prevent sharing such content is for developers to enforce stricter, more effective filters.


Read the original article on: Techxplore

Read more: Are Chatbots Trustworthy? A New Tool Helps Simplify Their Evaluation

Share this post

Leave a Reply