This “Smart Coach” Guides LLMs in Transitioning Between Text and Code

Large language models (LLMs) are skilled at using textual reasoning to interpret documents and provide logical responses. However, they often stumble on basic math problems.
Image Credits: Christine Danilof, MIT, ISTOCK

Large language models (LLMs) are skilled at using textual reasoning to interpret documents and provide logical responses. However, they often stumble on basic math problems.

This is because textual reasoning isn’t ideal for solving computational or algorithmic tasks. While some LLMs can generate code—such as Python for symbolic problems, they frequently struggle to determine when to use code or what kind would be most effective.

To help with this, MIT researchers developed CodeSteer, a “smart assistant” that helps LLMs decide when to switch between text and code generation to solve a given problem.

Iterative Prompting for Smarter Solutions

CodeSteer, a smaller LLM itself, creates a sequence of prompts to guide a larger model. It reviews the model’s current and past answers after each step and suggests improvements until it reaches the correct solution.

In testing, CodeSteer improved accuracy on symbolic tasks like multiplication, Sudoku, and block stacking by over 30%. It also helped less capable models outperform more advanced ones by boosting their reasoning abilities.

This development could enhance LLMs’ ability to tackle complex problems that go beyond the limits of textual reasoning such as plotting navigation paths for robots in unpredictable settings or coordinating logistics in global supply chains.

“While there’s a race to build all-in-one models, we’re taking a different path,” says Chuchu Fan, associate professor of aeronautics and astronautics at MIT and lead investigator at the Laboratory for Information and Decision Systems (LIDS). “Researchers have long developed effective tools for solving specific problems. Our goal is to help LLMs identify and apply the right tools and draw on existing expertise to expand their capabilities.”

Fan, the study’s senior author, collaborated on the research with LIDS graduate student Yongchao Chen; AeroAstro graduate student Yilun Hao; Yueying Liu from the University of Illinois at Urbana-Champaign; and Yang Zhang, a research scientist at the MIT-IBM Watson AI Lab. The findings will be presented at the International Conference on Machine Learning.

An LLM “Coach”

If you ask an LLM which number is larger 9.11 or 9.9 it often gets it wrong by relying on textual reasoning. But when prompted to use code, it can generate and run a simple Python script to compare the numbers correctly.

Developers originally train LLMs to understand and respond in human language, so the models tend to favor text-based answers even when a coding solution would be more accurate. Though fine-tuning has taught some LLMs to write code, they still frequently produce incorrect or inefficient versions.

Instead of retraining a large model like GPT-4 or Claude, MIT researchers chose to fine-tune a smaller, lightweight LLM to guide the larger one in deciding when to switch between text and code. This approach doesn’t modify the core of the bigger model, preserving its original capabilities.

“We took inspiration from how human coaching works,” says Chen. “A trainer may not be more skilled than the star athlete, but they can still provide valuable guidance. The same idea applies to LLMs.”

CodeSteer: A Smart Trainer for Guiding LLM Responses

This “trainer,” called CodeSteer, works alongside the main LLM. It reviews the query, decides whether a text or code response is more appropriate, and selects the best type of code to use.

CodeSteer then creates a prompt for the larger LLM, instructing it to use either code or textual reasoning to respond to the query. The larger model generates an answer based on this prompt and returns it to CodeSteer for evaluation.

If the result is incorrect, CodeSteer continues guiding the model, suggesting alternative strategies like adding a search algorithm or constraints to the Python code until it arrives at the correct answer.

“Often, we found that the larger LLM tends to take shortcuts, producing overly simple or inefficient code that fails to handle the symbolic reasoning properly,” Chen explains. “CodeSteer is designed to counteract this behavior.”

To support this, a symbolic checker analyzes the code’s complexity and alerts CodeSteer if the solution is too basic or inefficient. Additionally, CodeSteer includes a self-answer checker that prompts the LLM to generate code that verifies the answer’s accuracy.

Solving Challenging Problems

While developing CodeSteer, the researchers found that existing symbolic datasets weren’t suitable for fine-tuning or evaluation—most benchmarks didn’t specify whether a query should be solved using text or code.

To address this gap, they created their own dataset, SymBench, consisting of 37 complex symbolic tasks across domains like spatial reasoning, mathematics, ordering, and optimization. They then fine-tuned CodeSteer using this dataset to enhance its performance.

In testing, CodeSteer outperformed all nine baseline methods and raised average accuracy from 53.3% to 86.4%. It also showed strong generalization, performing well on unseen tasks and across various LLMs.

Additionally, general-purpose models paired with CodeSteer surpassed the accuracy of specialized state-of-the-art models designed for advanced reasoning and planning—while using significantly less computational power.

“Our method builds on the LLM’s own strengths,” says Chen. “By teaching a model when and how to use code effectively, we can boost the performance of even already-powerful systems.”

Enhancing Efficiency and Unifying Reasoning with Code Generation

Looking ahead, the researchers aim to make CodeSteer more efficient by speeding up its iterative prompting process. They’re also exploring how to fine-tune a single, unified model capable of switching between textual reasoning and code generation eliminating the need for a separate assistant.

“This is a smart and elegant approach to a major challenge in LLM tool usage,” says Jinsung Yoon, a staff research scientist at Google Cloud AI who was not involved in the project. “It offers a simple yet effective way to boost LLM performance without retraining the models directly. This work marks a meaningful advancement that could broaden the real-world use of LLMs, especially on tasks where they currently fall short.”

Chi Wang, a senior staff scientist at Google DeepMind, also praised the work: “Training a smaller, specialized model to guide more advanced ones is a powerful idea. This kind of collaboration among AI ‘agents’ opens the door to stronger, more flexible systems for solving complex, real-world problems.”

The project received support from the U.S. Office of Naval Research and the MIT-IBM Watson AI Lab.


Read the original article on: MIT Massachusetts Institute Of Technology

Read more: Smart Ring with ECG: A Health Tech Game-Changer