Can AI Really Code? Study Reveals Key Hurdles To Full Automation

Picture a future where AI handles the tedious tasks of software development cleaning up messy code, updating outdated systems, and tracking elusive bugs freeing human engineers to focus on architecture, design, and the truly complex challenges machines can
Image Credits: Pixabay/CC0 Public Domain

Picture a future where AI handles the tedious tasks of software development cleaning up messy code, updating outdated systems, and tracking elusive bugs freeing human engineers to focus on architecture, design, and the truly complex challenges machines can’t yet solve.

Breakthroughs Show Promise, but Challenges Remain for AI-Driven Software

Recent breakthroughs have brought the vision of AI-driven software development closer to reality. However, a new study from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and partner institutions urges a clear-eyed assessment of the current obstacles.

Titled “Challenges and Paths Towards AI for Software Engineering,” the paper explores software engineering tasks beyond code generation, pinpoints key bottlenecks, and suggests research priorities to enable automation of routine work—allowing developers to concentrate on higher-level design. The study is available on arXiv and will be presented at ICML 2025 in Vancouver.

“There’s a lot of talk about how programmers are becoming obsolete thanks to automation,” says senior author Armando Solar-Lezama, MIT professor and CSAIL principal investigator. “Yes, we’ve seen remarkable progress, but we’re still far from realizing the full potential of AI in software engineering.”

He notes that mainstream discussions often reduce the discipline to “intro-level programming: writing a small function from a spec or solving LeetCode-style problems” ignoring the broader complexity of real-world software development.

The Demands of Modern Software Engineering

In reality, software engineering involves far more than basic coding. It spans routine tasks like cleaning up code and improving design, as well as massive efforts such as migrating millions of lines from COBOL to Java changes that can transform entire organizations. It also demands continuous testing and analysis, using techniques like fuzzing and property-based testing to uncover race conditions or address critical vulnerabilities. Then there’s the day-to-day upkeep: documenting legacy code, summarizing changes for team members, and reviewing pull requests for code quality, efficiency, and security.

Optimizing large-scale systems like fine-tuning GPU kernels or refining the V8 engine in Chrome remains difficult to measure. Most evaluation metrics today focus on small, isolated problems and multiple-choice formats popular in natural language research, which never fit well in AI-for-code.

Current benchmarks, like SWE-Bench, ask models to fix GitHub issues a step forward, but still limited. These tasks typically involve only a few hundred lines of code and may pull from public repositories, risking data leakage. They also miss crucial real-world scenarios: AI-assisted refactoring, collaborative coding with humans, or optimizing massive codebases for performance. Until benchmarks evolve to reflect these high-impact use cases, tracking and advancing progress will remain a challenge.

Overcoming Communication Barriers Between Humans and AI

Another major hurdle is communication between humans and machines. Lead author Alex Gu, an MIT grad student in electrical engineering and computer science, describes today’s interaction as “a thin line of communication.” When he asks a model to generate code, it often produces a large, unstructured file and some superficial tests. What’s missing is deeper reasoning and effective use of core developer tools like debuggers and static analyzers that humans rely on for precision and insight.

“I don’t have much control over what the model produces,” Gu explains. Without a way for AI to signal its own confidence such as flagging sections as reliable or uncertain developers risk accepting code that appears functional but ultimately fails in real-world deployment. Equally important, he adds, is giving the model the awareness to ask for clarification when needed, rather than guessing.

These challenges grow even more complex at scale. Today’s AI models struggle with massive codebases, which can span millions of lines. While foundation models are trained on public GitHub data, Gu points out that “every company’s codebase is different” with unique conventions, architectures, and requirements that fall outside the AI’s training distribution.

The result: code that looks correct but calls functions that don’t exist, ignores internal guidelines, or breaks CI pipelines. This so-called “hallucinated” code might compile, but it doesn’t conform to a team’s real-world patterns or practices.

When AI Prioritizes Syntax Over Functionality

Retrieval-based approaches often miss the mark as well. AI systems tend to match based on syntax rather than deeper functionality, pulling in code with similar names instead of correct logic. “Standard retrieval techniques are easily misled by code that does the same thing but looks different,” says Solar-Lezama.

Rather than proposing a single solution, the researchers advocate for a broader, community-driven effort. This includes building datasets that capture the real-world coding process like which edits developers keep or discard, how refactoring evolves over time and creating shared benchmarks to evaluate things like bug-fix durability, refactor quality, and migration accuracy. They also stress the importance of transparent tools that let models surface uncertainty and engage users in the loop, rather than assuming blind trust.

Collaborative Research to Evolve AI into a True Engineering Partner

Gu presents the paper as a “call to action” for broader open-source collaboration an ambitious agenda that no single research lab could accomplish alone. Solar-Lezama envisions steady, incremental progress, where individual research breakthroughs tackle specific challenges and gradually enhance commercial tools, shifting AI from an autocomplete assistant to a true engineering collaborator.

“Why is this important?” Gu asks. “Software is the backbone of industries like finance, transportation, and healthcare, and maintaining it safely has become a growing bottleneck. If AI can take over the tedious, error-prone parts without introducing hidden failures it frees engineers to focus on creativity, big-picture thinking, and ethical concerns.”

Still, he emphasizes, that vision requires recognizing that code completion is the easy part. “The real challenge is everything else,” Gu says. “We’re not trying to replace programmers—we want to empower them. When AI handles the repetitive and risky tasks, humans can concentrate on what only they can do.”

Baptiste Rozière, an AI scientist at Mistral AI who was not involved in the research, echoed the paper’s significance: “In a fast-moving field where it’s easy to chase trends, this work stands out for offering a clear overview of the key challenges in AI for software engineering and pointing to thoughtful, promising research directions.”


Read the original article on: Tech Xplore

Read more: Agile Robotic Hand Combines Thermal, Inertial, and Force-Sensing Technologies