Vision-Language AI Models Generate Video Descriptions To Assist Blind Users - Scitke

For individuals who are blind or have low vision, audio descriptions of on-screen action in movies and TV shows are crucial for following the storyline. While networks and streaming platforms typically employ professionals to produce these descriptions, the vast majority of videos on platforms like YouTube and TikTok lack such accessibility features.

That doesn’t mean people aren’t interested in the content.

AI-Powered Crowdsourcing Platform Lets Blind Users Request and Rate Video Descriptions

To help make user-generated videos more accessible, researchers at Northeastern University are using AI vision-language models (VLMs) to generate audio descriptions through a crowdsourced platform called YouDescribe. Similar to a library system, blind and low-vision users can request descriptions for specific videos, then rate and even contribute their own.

“It makes sense that a short 20-second TikTok video of someone dancing might not have a professional description,” says Lana Do, who earned her master’s degree in computer science from Northeastern’s Silicon Valley campus in May. “But blind and low-vision users may still want to experience that dancing video.”

One example: a 2020 music video of BTS’s “Dynamite” tops YouDescribe’s wishlist, waiting to be described. Although the platform has around 3,000 volunteer describers, the demand far exceeds capacity. Currently, only 7% of the requested videos on the wishlist have been described, according to Do.

Do conducts her research in the lab of Ilmi Yoon, a teaching professor of computer science at Northeastern’s Silicon Valley campus. Yoon became part of the YouDescribe team in 2018 to help integrate machine learning into the platform.

New Tools Boost YouDescribe’s Accuracy and Accessibility

This year, Do introduced several enhancements to improve the efficiency of YouDescribe’s human-in-the-loop process. The latest vision-language model (VLM) technology now delivers higher-quality audio descriptions, and a new “infobot” feature lets users ask questions about specific video frames. Additionally, a collaborative editing interface enables low-vision users to correct any inaccuracies in the descriptions, Do explains.

The goal is to make video content descriptions more accurate and readily available. AI-generated drafts help reduce the workload for human describers, while users can actively participate by providing ratings and feedback, Do explains.

“For example, someone might watch a forest documentary, hear an undescribed flapping sound, and wonder what caused it,” she says.

Showcasing AI’s Potential

Do and her team recently presented a paper at the Symposium on Human-Computer Interaction for Work in Amsterdam, highlighting how AI can help speed up the creation of audio descriptions. According to Yoon, AI performs surprisingly well at describing human gestures and facial expressions. In one demo, an AI describes a chef’s step-by-step process for making cheese rolls.

However, challenges remain. Yoon notes that AI struggles with interpreting facial expressions in animated characters and often misses the most crucial elements in a scene—an area where humans excel, especially when clarity and relevance are essential in a description.

“It’s a very labor-intensive task,” Yoon says.

Graduate students in her lab analyze how AI-generated drafts compare to descriptions written by humans.

“We identify the gaps and use that data to improve the AI’s performance,” she explains. “Blind users don’t want overwhelming or unnecessary narration. Creating a good description is really an editorial skill—capturing what matters most in a clear and concise way.”

The San Francisco-based Smith-Kettlewell Eye Research Institute launched YouDescribe in 2013 to train sighted volunteers to produce audio descriptions. The platform focuses on making YouTube and TikTok videos accessible, offering tutorials on how to record and synchronize narration effectively for user-generated content.

Read the original article on: Tech Xplore