Anthropic’s CEO Aims to Make AI Models More Transparent by 2027

On Thursday, Anthropic CEO Dario Amodei published an essay underscoring how little is known about the inner workings of today’s most advanced AI models. To tackle this, he set a bold target for Anthropic: by 2027, the company aims to reliably detect and address most issues within AI systems.
In his essay, The Urgency of Interpretability, Amodei admits the road ahead won’t be easy. While Anthropic has made early progress in tracking how models generate their outputs, he stresses that much deeper research is necessary to truly understand these increasingly complex systems.
“I’m deeply concerned about deploying these models without a clearer understanding of how they operate,” Amodei wrote. “They’ll be central to our economy, technology, and national security, and so autonomous that it’s simply unacceptable for us to remain in the dark about their decision-making.”
Anthropic Leads the Charge in Decoding AI Decision-Making
Anthropic is at the forefront of mechanistic interpretability—a field focused on unraveling the “black box” of AI models to understand the reasoning behind their decisions. Despite rapid advances in AI capabilities, researchers still know relatively little about how these systems reach their conclusions.
For instance, OpenAI recently introduced new reasoning models, o3 and o4-mini, which outperform earlier versions on some tasks—but they also tend to hallucinate more frequently. The cause remains unclear, even to their creators.
In his essay, Dario Amodei points out a major limitation of today’s generative AI systems: when an AI summarizes something like a financial report, we can’t explain—at a detailed level—why it chooses specific words or makes occasional errors, even when it’s usually accurate.
He highlights a comment by Anthropic co-founder Chris Olah, who said AI models are “grown more than they are built,” meaning researchers have found ways to improve model performance without fully understanding why these improvements work.
Amodei warns that approaching artificial general intelligence (AGI)—which he describes as “a country of geniuses in a data center”—without truly grasping how these models function could be risky. Although he previously estimated AGI might arrive by 2026 or 2027, he now believes understanding these systems could take much longer.
Amodei Proposes “Brain Scans” for AI to Ensure Safer Deployment
Looking ahead, Amodei envisions conducting deep diagnostic tests—like “brain scans” or “MRIs” for AI—to uncover a range of potential issues, such as tendencies toward dishonesty or power-seeking behavior. He estimates this kind of interpretability could take five to ten years to achieve, but sees it as essential for safely deploying future AI models.
Anthropic has already made progress in this area. The company has begun mapping “circuits” within its models—pathways that reveal how the AI processes information. One such circuit helps the model understand the relationship between U.S. cities and states. Although only a few circuits have been identified, Amodei estimates there could be millions of them in large models.
The company has also started investing in external startups focused on interpretability, reinforcing its commitment to this research. While currently viewed as part of AI safety, Amodei believes understanding how models reach conclusions could eventually become a business advantage as well.
In his essay, Amodei urged major players like OpenAI and Google DeepMind to ramp up their efforts in interpretability research. He also called on governments to adopt “light-touch” regulations that promote transparency—such as requiring companies to disclose their safety practices—and advocated for export controls on advanced AI chips to China to prevent a global AI arms race.
Anthropic has long distinguished itself from rivals by prioritizing AI safety. While other tech firms resisted California’s proposed AI safety bill (SB 1047), Anthropic offered cautious support and suggestions, aligning with its broader call for a more responsible, industry-wide approach to understanding—and not just advancing—AI capabilities.
Read the original article on: TechCrunch
Read more: OpenAI’s New AI Models Are Hallucinating More Than Expected