Leaked Documents Reveal China's AI-Driven Censorship System

Leaked Documents Reveal China’s AI-Driven Censorship System

By Jorge Paka Artificial Intelligence, Tech Censorship System, China's AI, Leaked Documents Comments Off

A complaint about rural poverty, a news report on a corrupt Communist Party official, or a plea for help against extortion by police—these are just a few of the 133,000 examples used to train a powerful AI system designed to automatically detect content deemed sensitive by the Chinese government.

Leaked Data Exposes China’s Expanding AI-Powered Censorship System

A leaked database obtained by TechCrunch reveals that China has developed an AI-driven censorship system that enhances its already extensive monitoring capabilities, extending far beyond traditional red lines like the Tiananmen Square massacre.

While primarily aimed at controlling online discourse within China, the system could also be used to further refine censorship in Chinese AI models.

This photo taken on June 4, 2019, shows the Chinese flag behind razor wire at a housing compound in Yengisar, south of Kashgar, in China’s western Xinjiang region.Image Credits:Greg Baker / AFP / Getty Images

Leaked Dataset Shows China Using AI to Strengthen Repression, Expert Says

Xiao Qiang, a researcher at UC Berkeley specializing in Chinese censorship, told TechCrunch that the leaked dataset provides “clear evidence” that the Chinese government or its affiliates aim to use large language models (LLMs) to enhance repression.

“Unlike traditional censorship methods, which depend on human labor for keyword filtering and manual review, an LLM trained on such data would drastically improve the efficiency and precision of state-led information control,” Qiang explained.

This aligns with growing evidence that authoritarian regimes are rapidly adopting advanced AI technologies. In February, for instance, OpenAI reported that multiple Chinese entities had used LLMs to monitor anti-government posts and discredit dissidents.

In response, the Chinese Embassy in Washington, D.C., told TechCrunch that it opposes “groundless attacks and slanders against China” and emphasized its commitment to ethical AI development.

Security researcher NetAskari discovered the dataset and shared a sample with TechCrunch after finding it in an unsecured Elasticsearch database hosted on a Baidu server.

This does not suggest any direct involvement from either company, as various organizations use these providers for data storage.

The exact creator of the dataset remains unknown, but records indicate it is recent, with the most recent entries dating back to December 2024.

An unnamed LLM is tasked with flagging politically, socially, or militarily sensitive content as “highest priority.”

AI Censorship Targets Pollution, Fraud, Labor Disputes, and Political Satire

Key targets include pollution scandals, financial fraud, labor disputes, and political satire—especially historical analogies about current leaders or mentions of “Taiwan politics.” Military reports on troop movements and weaponry are also closely monitored.

Dataset snippets reference prompt tokens and LLMs, confirming AI-driven censorship.

TechCrunch analyzed 10 samples from the 133,000 flagged for censorship.

Many address sensitive issues, such as police corruption, rural poverty, and a CCP official expelled for “superstitious” beliefs over Marxism.

Taiwan and military topics are heavily monitored, with “Taiwan” appearing over 15,000 times in the dataset.

Even subtle dissent is flagged, including an idiom about power’s fleeting nature—an especially sensitive theme in China’s authoritarian system.

The dataset lacks details about its creators but states it is intended for “public opinion work,” a strong indicator of its alignment with Chinese government objectives, an expert told TechCrunch.

China’s CAC Uses AI to Strengthen Censorship Under ‘Public Opinion Work’

Michael Caster, Asia program manager at rights group Article 19, noted that “public opinion work” falls under the Cyberspace Administration of China (CAC), which oversees censorship and propaganda.

The ultimate goal is to safeguard Chinese government narratives online while eliminating dissent. President Xi Jinping has even called the internet the “frontline” of the CCP’s “public opinion work.”

TechCrunch’s analysis of the dataset adds to growing evidence that authoritarian regimes are harnessing AI for repression.

Last month, OpenAI reported that an unidentified entity, likely based in China, used generative AI to track social media discussions—especially those supporting human rights protests against China—and relay the information to authorities.

OpenAI also discovered AI being used to generate critical comments about prominent Chinese dissident Cai Xia.

China’s censorship traditionally relies on basic algorithms that block blacklisted terms like “Tiananmen massacre” or “Xi Jinping,” as many users noticed when testing DeepSeek.

However, LLMs can enhance censorship by detecting subtle criticism on a massive scale. Some AI models can even refine their capabilities as they process more data.

“This shift toward AI-driven censorship is making state control over public discourse more sophisticated, especially as Chinese models like DeepSeek gain traction,” Xiao, the Berkeley researcher, told TechCrunch.

Read the original article on: TechCrunch

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Leaked Documents Reveal China’s AI-Driven Censorship System