Japan’s First Public AI with Real-Time Speech Interaction - Scitke

How can an AI system be developed to closely replicate human speech?Nagoya University researchers have advanced this goal by creating J-Moshi—the first public AI mimicking Japanese conversational styles.

J-Moshi Captures the Natural Flow of Japanese Conversation Through Aizuchi

J-Moshi replicates the natural rhythm of Japanese conversation, which frequently includes brief interjections called aizuchi. Responses like “Sou desu ne” and “Naruhodo” show active listening and are more common in Japanese than English.

Conventional AI struggles with aizuchi because it can’t speak and listen simultaneously, a key ability for creating natural-sounding Japanese dialogue. As a result, J-Moshi has gained popularity among Japanese speakers who value its lifelike conversational style.

Say Hello To J Moshi T 3 — Image Credits: Prof. Higashinaka (right) and his team are collaborating on developing hu

Researchers at Higashinaka Laboratory adapted the English Moshi model from Kyutai to develop J-Moshi. The adaptation process took around four months and involved training the AI with several Japanese speech datasets. Their findings are available on the arXiv preprint server.

The main training data was J-CHAT, Japan’s largest public dialogue dataset with 67,000 hours of audio. The team also incorporated smaller, high-quality datasets—some developed in-house and others dating back 20–30 years. They expanded training data by converting written chats into synthetic speech with custom text-to-speech tools.

Say Hello To J Moshi T 1 — Image Credits: techxplore

In January 2024, J-Moshi attracted widespread attention after its demo videos went viral on social media. Beyond its innovation, the system aids language learning by helping non-native speakers practice natural Japanese conversation.

The research team is also investigating commercial uses for J-Moshi in areas like call centers, healthcare, and customer service. Adapting the system for specialized fields is challenging due to limited Japanese speech data compared to English.

Bridging Industry Expertise and Academic Innovation

Leading the team is Professor Ryuichiro Higashinaka, who brings a rare blend of industry and academic experience. Before joining Nagoya University five years ago, he spent 19 years at NTT developing dialogue systems like Shabette Concier.

In 2020, he founded a lab at Nagoya University focusing on human communication, bridging theory and practice with a 20-member team working on Japanese conversation timing and AI guides in public spaces.

“Technology like J-Moshi can enhance systems that work alongside human operators,” said Professor Higashinaka. “At Osaka’s NIFREL Aquarium, our guide robots handle routine tasks but transfer complex queries to human staff.” This project is part of Japan’s Cabinet Office Moonshot Initiative to improve service through AI-human collaboration.

Say Hello To J Moshi T 4 — Image Credits:Ph.D. student Sanae Yamashita (left) works on techniques that summa

Professor Higashinaka highlighted the distinct challenges in Japanese AI research: “Japan faces a shortage of speech data, which limits the ability to train AI dialogue systems. Privacy issues also need careful consideration.”

This lack of data has driven innovative approaches, such as using software to separate overlapping voices in podcast recordings into individual speaker tracks for training purposes.

Challenges of Dialogue Systems in Complex Social and Visual Environments

Currently, dialogue systems struggle with complex social contexts, especially when they must account for interpersonal relationships and physical surroundings. Visual barriers like masks or hats can also reduce effectiveness by hiding important facial cues. Testing at Osaka’s NIFREL Aquarium showed the AI sometimes needs human help to answer questions.

J-Moshi is a breakthrough in natural Japanese conversation but still relies on human support. The team is enhancing this by creating tools to summarize dialogues and detect issues for timely operator intervention.

Beyond J-Moshi, the lab’s research covers broader human-robot interaction techniques. Working alongside colleagues specializing in lifelike humanoid robots, they are creating systems that synchronize speech, gestures, and movements to enable more natural communication.

Advancing AI in Robotics

Robots like those from Unitree Robotics showcase AI advances that combine conversation with physical presence. The team often demos their work at university events for the public.

Their research paper on J-Moshi has been accepted for presentation at Interspeech, the world’s largest international conference on speech technology and research. Professor Higashinaka and his team are excited to share their findings in Rotterdam, The Netherlands, in August 2025.

“In the near future, we will see systems that collaborate effortlessly with humans through natural speech and gestures. My goal is to develop the core technologies that will drive this transformative future,” Professor Higashinaka stated.

Read the original article on: Techxplore