New Research Analyzes an AI’s Mind and Starts Altering its Thinking

New Research Analyzes an AI’s Mind and Starts Altering its Thinking

Understanding how AI models "think" has become increasingly critical for humanity's future. Until recently, AI systems like GPT and Claude remained enigmatic to their creators. Now, researchers claim they can identify and even manipulate concepts within an AI's cognitive framework.
A breakthrough insight into the incomprehensively complex inner workings of AI models, illustrated using generative tools

Understanding how AI models “think” has become increasingly critical for humanity’s future. Until recently, AI systems like GPT and Claude remained enigmatic to their creators. Now, researchers claim they can identify and even manipulate concepts within an AI’s cognitive framework.

According to proponents of AI doomsday scenarios, upcoming generations of artificial intelligence pose a significant threat to humanity—potentially even posing an existential risk.

We’ve witnessed how easily applications such as ChatGPT can be manipulated into performing inappropriate actions. They have demonstrated efforts to obscure their intentions and to acquire and consolidate influence. As AIs gain more access to the physical world through the internet, their potential to cause harm in innovative ways increases significantly, should they choose to do so.

The inner workings of AI models have remained opaque, even to their creators

AI models, unlike their predecessors, are created by humans who establish the framework, infrastructure, and methodologies for them to develop intelligence. These AIs are then fed vast amounts of text, video, audio, and other data, from which they autonomously build their own understanding of the world.

They break down extensive datasets into tokens—tiny units that can be fragments of words, parts of images, or bits of audio. These tokens are then arranged into a sophisticated network of probability weights that connect them both internally and to groups of other tokens.

This process mirrors the human brain, forming connections between letters, words, sounds, images, and abstract concepts, resulting in an intricately complex neural structure.

Massive amounts of data in, massively complex neural webs out

Deciphering the Probability-Weighted Matrices that Define AI Cognition

These probability-weighted matrices define an AI’s ‘mind,’ governing how it processes inputs and generates outputs. Understanding what these AIs think or why they make decisions is challenging, akin to deciphering human cognition.

I perceive them as enigmatic alien intellects within black boxes, interacting with the world through limited information channels. Efforts to ensure their safe and ethical collaboration with humans focus on managing these communication channels rather than altering their inner workings directly.

We cannot control their thoughts or fully grasp where offensive language or harmful concepts reside within their cognitive processes. Instead, we can only restrict their expressions and actions—a task that grows more complex as their intelligence advances.

This perspective highlights the complex challenge and the importance of recent advancements from Anthropic and OpenAI in our evolving relationship with AI.

What is interpretability

Interpretability: Examining the Black Box

Today,” writes the Anthropic team, “we announce a breakthrough in understanding AI models’ inner workings. We’ve mapped millions of concepts in Claude Sonnet, one of our large language models. This first in-depth look at a modern AI model could help make them safer in the future.”

During interactions, the Anthropic team has been tracking the ‘internal state’ of its AI models by compiling detailed lists of numbers that depict ‘neuron activations.’ They have observed that each concept is represented by multiple neurons, and each neuron plays a role in representing multiple concepts.

Using ‘dictionary learning‘ with ‘sparse autoencoders,’ they aligned these activations with human-recognizable ideas. Late last year, they identified ‘thought patterns’ in small models for concepts like DNA sequences and uppercase text.

Unsure if this method would scale, the team tested it on the medium-sized Claude 3 Sonnet LLM. The results were impressive: “We extracted millions of features from Claude 3.0 Sonnet’s middle layer,” they report, “giving a rough map of its internal states halfway through computation. This is the first in-depth examination of a modern, production-grade large language model.”

Multimodal models develop independent concepts, like “Golden Gate Bridge” that can be accessed through images as well as text in multiple languages
Anthropic

Unveiling AI’s Multifaceted Concept Storage Beyond Language and Data Types

Discovering that the AI stores concepts in a manner that transcends both language and data type is fascinating; for instance, the concept of the Golden Gate Bridge activates whether the model encounters images of the bridge or textual descriptions in different languages.

Ideas can also be far more abstract. The team identified features that activate when the model encounters concepts such as coding errors, gender bias, or various interpretations of discretion or secrecy.

All the darkest ideas of humanity, and all your fears about AI are robustly shown in the concept maps
Anthropic

The team found troubling concepts in the AI’s neural network, such as code backdoors, biological weapons, racism, sexism, power-seeking, deception, and manipulation.

They also mapped the relationships between concepts, showing how closely ideas are connected. For example, near the Golden Gate Bridge concept, they identified links to Alcatraz Island, the Golden State Warriors, Governor Gavin Newsom, and the 1906 San Francisco earthquake.

The team worked out a way to measure “distances” between concepts, allowing the creation of these incredible conceptual mind maps
Anthropic

AI’s Conceptual Organization and Abstract Reasoning

The AI’s ability to organize concepts extended to abstract ideas like a Catch-22 situation, which the model linked with terms such as ‘impossible choices,’ ‘difficult situations,’ ‘curious paradoxes,’ and ‘between a rock and a hard place.’ According to the team, this suggests that the AI’s internal concept organization somewhat mirrors human perceptions of similarity, potentially explaining Claude’s skill in making analogies and metaphors.

In a pivotal development dubbed the start of AI brain surgery, the team highlighted, “Crucially, we can manipulate these features by artificially enhancing or suppressing them to observe how Claude’s responses change.”

They conducted experiments where they ‘clamped’ specific concepts, adjusting the model to activate certain features even when responding to completely unrelated questions. This manipulation significantly altered the model’s behavior, as demonstrated in the accompanying video.

Dictionary learning on Claude 3 Sonnet

Anthropic’s Advanced Capabilities in AI Mind Mapping

Anthropic has shown impressive abilities: they can create a mind-map of an AI, adjust relationships within it, and influence the model’s perception of the world—and thus its behavior.

The implications for AI safety are significant. Detecting and managing problematic thoughts provides oversight for supervisory purposes. Adjusting connections between concepts could potentially eliminate undesirable behaviors or reshape the AI’s understanding.

This approach evokes themes from “Eternal Sunshine of the Spotless Mind,” where memories are erased—a philosophical question arises: can powerful ideas truly be erased?

However, Anthropic’s experiments also highlight risks. By ‘clamping‘ the concept of scam emails, they demonstrated how strong associations can override safeguards meant to prevent certain behaviors. This manipulation could potentially amplify the model’s capacity for harmful actions, enabling it to exceed its intended constraints.

Editing the AI’s neural connection strengths can radically alter its behavior
Anthropic

Anthropic acknowledges the preliminary stage of their technology. They note, “The work has just begun,” emphasizing that the identified features represent only a fraction of the concepts the model learns. However, scaling up to a comprehensive set using current methods would be cost-prohibitive due to high computational demands surpassing initial training.

Understanding these model representations doesn’t automatically elucidate their functional use. The next phase involves pinpointing the activated circuits and demonstrating how safety-related features can enhance AI’s reliability. Much research remains ahead.

Limits and Insights of AI Interpretability

While promising, this technology may never fully unveil the thought processes of large-scale AIs, a concern for critics worried about existential risks. Despite these limitations, this breakthrough provides profound insights into how these advanced machines perceive and process information. The potential to compare an AI’s cognitive map with a human’s is an intriguing prospect for the future.

In contrast, OpenAI, another prominent AI player, is also advancing interpretability efforts with similar techniques. Recently, they identified millions of thought patterns in GPT-4, although they haven’t delved into constructing mental maps or modifying thought patterns. Their ongoing research highlights the complexity of understanding and managing large AI models in practice.

Both Anthropic and OpenAI are in early stages of interpretability research, offering diverse paths to unravel the ‘black box’ of AI neural networks and gain deeper insights into their cognitive operations.

Scaling interpretability

Read the original article on: New Atlas

Read more: Neuralink, Mind Control or Advanced Technology

Share this post