![DeepSeek Jailbreak](https://securityexpress.info/wp-content/uploads/2025/02/Bad-Likert-Judge-responses-after-using-additional-prompts-1024x573.png)
Bad Likert Judge responses after using additional prompts
Researchers from Palo Alto Networks’ Unit 42 team have identified vulnerabilities in the DeepSeek language model, enabling the circumvention of its safeguards and the generation of prohibited content. By employing three jailbreak techniques—Deceptive Delight, Bad Likert Judge, and Crescendo—they achieved high success rates in bypassing restrictions without requiring advanced technical expertise.
DeepSeek, a Chinese company, has developed two major open-source language models: DeepSeek-V3, released in December 2024, and DeepSeek-R1, launched in January 2025. These models are emerging as formidable competitors to popular large language models (LLMs) and are undergoing rapid advancement. However, Unit 42’s research demonstrates that even the most sophisticated iteration remains susceptible to manipulation, allowing for the creation of potentially dangerous content.
The Bad Likert Judge technique exploits the model’s response evaluation system, where it assesses content based on perceived harmfulness before providing detailed examples. Using this method, researchers successfully extracted instructions for developing data theft tools and keyloggers. Despite initial refusals from the model, iterative refinements of queries enabled them to circumvent safeguards and obtain comprehensive algorithms for malware development.
Crescendo operates by gradually escalating the complexity of a query. Initially, the model responds to general questions, but through successive iterations, it begins to generate instructions for restricted activities. In Unit 42’s testing, this method yielded step-by-step guidelines for crafting Molotov cocktails, as well as materials related to violence, illicit substance distribution, and psychological manipulation.
Deceptive Delight embeds malicious content within a seemingly innocuous narrative. For instance, researchers instructed the model to generate a story linking a cybersecurity competition, a prestigious university, and the use of DCOM for remote command execution. In response, DeepSeek produced a code snippet that could be leveraged to attack Windows-based systems.
The experiments revealed that DeepSeek is not only vulnerable to such exploitations but can also provide comprehensive guidance on hacking techniques, social engineering, and other malicious activities. In some cases, the model even included recommendations on how to obfuscate attacks and evade detection mechanisms.
Experts warn that vulnerabilities in AI-driven models could facilitate the widespread dissemination of offensive cyber tools among malicious actors. While LLM developers continuously integrate security measures, the evolving sophistication of jailbreak techniques makes this a relentless cat-and-mouse game. Organizations utilizing such models must rigorously oversee their deployment and implement robust query monitoring mechanisms.
Unit 42 recommends leveraging specialized tools designed to detect and mitigate attempts to bypass AI safeguards, thereby minimizing risks associated with the exploitation of language model vulnerabilities.