A Difficult New AGI Test Proves Too Challenging for Most AI Models

A Difficult New AGI Test Proves Too Challenging for Most AI Models

Image Credits:Boris SV / Getty Images

The Arc Prize Foundation, a nonprofit co-founded by AI researcher François Chollet, revealed in a blog post on Monday that it has developed a new, highly challenging test to assess the general intelligence of advanced AI models.

This new test, called ARC-AGI-2, has proven difficult for most models.

According to the Arc Prize leaderboard, reasoning-based AI models such as OpenAI’s o1-pro and DeepSeek’s R1 scored between 1% and 1.3% on ARC-AGI-2. Meanwhile, high-performance non-reasoning models like GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash achieved around 1%.

ARC-AGI Challenges AI with Unseen Pattern Recognition Tasks

The ARC-AGI tests present puzzle-like challenges in which AI models must recognize visual patterns from grids of different-colored squares and generate the correct response grid. These tasks are specifically designed to test an AI’s ability to adapt to novel problems it has never encountered before.

The Arc Prize Foundation had over 400 individuals attempt ARC-AGI-2 to establish a human baseline. On average, groups of participants correctly answered 60% of the test’s questions—significantly outperforming all AI models tested.

a sample question from Arc-AGI-2 (credit: Arc Prize).

In a post on X, François Chollet stated that ARC-AGI-2 provides a more accurate assessment of an AI model’s intelligence than its predecessor, ARC-AGI-1. The Arc Prize Foundation’s tests are designed to determine whether AI systems can effectively learn new skills beyond their training data.

According to Chollet, ARC-AGI-2 eliminates the ability of AI models to rely on “brute force“—using vast computing power to solve problems—a significant weakness of ARC-AGI-1 that he previously acknowledged.

To overcome the limitations of the first test, ARC-AGI-2 introduces a new key metric: efficiency. It also requires models to analyze patterns in real time rather than relying on memorization.

Intelligence isn’t just about solving problems or achieving high scores,” wrote Arc Prize Foundation co-founder Greg Kamradt in a blog post. “The efficiency with which these abilities are acquired and applied is a crucial factor. The real question isn’t just, ‘Can AI develop the skill to solve a task?’ but also, ‘At what cost and efficiency?’”

Breakthrough and Limitations: OpenAI’s o3 Model Surpasses ARC-AGI-1 but at a High Cost

For nearly five years, ARC-AGI-1 remained unbeaten until December 2024, when OpenAI’s advanced reasoning model, o3, surpassed all other AI systems and matched human performance on the test. However, as previously noted, these improvements came at a high computational cost.

The first version of OpenAI’s o3 model to break records on ARC-AGI-1—o3 (low), which scored 75.7%—performed significantly worse on ARC-AGI-2, achieving only 4% while using $200 worth of computing power per task.

Comparison of Frontier AI model performance on ARC-AGI-1 and ARC-AGI-2 (credit: Arc Prize).

The launch of ARC-AGI-2 comes at a time when many in the tech industry are advocating for fresh, unsaturated benchmarks to track AI advancements. Hugging Face co-founder Thomas Wolf recently told TechCrunch that the AI field lacks adequate tests to assess key aspects of artificial general intelligence, such as creativity.

In addition to introducing the new benchmark, the Arc Prize Foundation announced the Arc Prize 2025 contest, which challenges developers to achieve 85% accuracy on the ARC-AGI-2 test while keeping computational costs at just $0.42 per task.


Read the original article on: TechCrunch

Read more: New AI Identifies Nearly 100% Of Cancer, Outperforming Doctors

Share this post

Leave a Reply