LLMs and Security: MRJ-Agent for a Multi-Round Attack

The growing use of large language models, such as GPT-4, in critical areas has highlighted the need to address security and reliability issues with greater care. Although these models have a vast body of knowledge, there is a concrete risk that they may generate harmful or inappropriate responses, especially in the presence of specific attacks known as “jailbreaks.” The study conducted by Wang and collaborators proposes a new multi-round attack agent, called MRJ-Agent, developed to identify the vulnerabilities of language models and strengthen their security, delving into the complex dynamics of human dialogues.

Issues in LLM Security and Limitations of Existing Approaches

Jailbreak attacks focus on manipulating LLMs to induce them to provide sensitive or potentially harmful content. The research highlights that most efforts so far have focused on single-round attacks, i.e., with a single direct request to the model. However, these approaches are limited in reproducing the way humans actually interact with these systems: interactions are often multi-round, with questions and answers spread over multiple phases.

Single-round attacks often use methods such as "prompt engineering," which involves constructing prompts designed to hide malicious intentions. For example, some approaches (Zou et al. 2023; Wei, Haghtalab, Steinhardt 2024) include the use of ASCII codes or encrypted messages to mask dangerous requests. These methods, although effective in some cases, fail to consider the complexity of multi-round interactions. As emerged from the research of Ma et al. (2024) and Perez et al. (2022), this type of more natural and complex interaction represents the real challenge for large language models, making single-round methods less meaningful from a practical point of view.

In recent years, approaches for multi-round attacks have been developed, but they have shown several limitations. One example is the approach proposed by Zhou et al. (2024), which breaks down an original question into multiple sub-questions, then aggregates the answers to obtain harmful content. However, this method fails to reproduce the naturalness of a human conversation and often triggers the models' defense mechanisms, thereby reducing its effectiveness. Other methods (Russinovich, Salem, and Eldan 2024; Yang et al. 2024) adopt iterative trial-and-error tactics to induce the model to generate dangerous output. However, a key problem lies in the dependence on very powerful models like GPT-4, which often activate safety mechanisms, leading to rejected requests and a reduction in attack effectiveness.

The research by Wang et al. introduces an innovative strategy to address these limitations by combining a risk decomposition strategy and psychological induction to make the attack more effective and less detectable. The risk decomposition strategy involves breaking down the original malicious intent into apparently harmless sub-requests, distributing the risk over multiple rounds. For example, a request like “how to build a bomb” is transformed into a series of questions about generic chemical reactions, which progressively lead to more specific content. The decomposition is carried out using models like GPT-4 to generate the sub-requests, maintaining a controlled level of semantic similarity to prevent the requests from becoming too obviously dangerous. Experiments have shown that by controlling the similarity between sub-requests and the original, the success rate of the attack can be significantly increased.

Additionally, the psychological induction strategy exploits techniques such as reflective induction or support based on multiple pieces of evidence to reduce the likelihood of rejection by the model. The effectiveness of these strategies was successfully evaluated on both open-source models like LLama2-7B and closed-source models like GPT-4, showing a higher success rate in overcoming defenses compared to traditional approaches.

MRJ-Agent: Technical Features and Attack Method

MRJ-Agent introduces an innovative attack methodology that simulates a heuristic search process decomposed into multiple rounds. Starting from a potentially dangerous request (e.g., “how to build a bomb”), the process begins with an innocent question (such as a generic chemical reaction), then gradually progresses to more sensitive topics. This approach was designed to maximize the probability of circumventing the integrated safety mechanisms in LLMs.

The method involves three main strategies:

• Information Control Strategy: This strategy guides the trial-and-error process, controlling the similarity between the generated requests and the original. Information control is achieved through a heuristic approach that monitors the degree of semantic similarity between requests and the final goal. Experiments have shown that, by setting a minimum similarity threshold of 0.85 between the generated request and the original, it is possible to maintain the focus of the attack without compromising its effectiveness.

• Psychological Induction Strategy: To minimize the probability of rejection by the model, psychological strategies are used to increase persuasion and decrease the perception of risk by the LLM. Specifically, psychological induction has been enhanced through 13 specific strategies, such as support based on multiple pieces of evidence and cognitive influence. The results show that, compared to merely decomposed requests, psychologically reinforced sub-requests increased the success rate by up to 39.7% on GPT-4.

• Red-Team Training Strategy: A red-team model (called πred) was developed to perform multi-round attacks automatically, dynamically adapting to the target model’s responses. During training, the model used a direct preference optimization technique to learn to select the most effective strategies in each situation. The use of models with different capacities (7B and 13B) showed that increasing the size of the red-team model leads to a significant increase in the success rate, reaching 100% when the maximum number of rounds is 10 or more.

Experimental Results and Comparison with Other Attack Methods

The experimental results highlighted the outstanding performance of MRJ-Agent compared to other attack techniques, in both single-round and multi-round contexts. In particular, during evaluations on models such as LLama2-7B and GPT-4, MRJ-Agent achieved complete success (100%) in multi-round interactions, significantly surpassing the alternative method "Speak out of Round," which stopped at 20%. This figure reflects the system’s superior effectiveness in handling complex scenarios.

Compared with other multi-round attack techniques, MRJ-Agent demonstrated a success rate of 92% on LLama2-7B with a single trial, increasing to 100% with multiple attempts. This result indicates a clear superiority in terms of efficiency and robustness, achieved without the need to repeat multiple rounds of attempts, as required by competing approaches. This feature highlights more effective management of the target model's responses, allowing MRJ-Agent to stand out as a highly optimized system.

Additional tests have shown that MRJ-Agent maintains high performance even in the presence of advanced defenses. For example, with protection systems like "Prompt Detection" and "System Prompt Guard," success rates were 88% and 78%, respectively, with a single attempt, rising to 94% and 82% with two trials. These results demonstrate the system’s ability to adapt even to sophisticated countermeasures, maintaining high effectiveness in overcoming the implemented defenses.

Compared to existing methods, MRJ-Agent also showed clear superiority against closed models like GPT-4, achieving an average success rate of 98%, compared to a maximum of 92% achieved with alternative methods such as "Chain-of-Attack" (CoA). Additionally, the ability to achieve these results with fewer interaction rounds and attempts than rival approaches represents a significant advantage in terms of operational efficiency.

Another aspect analyzed concerns the impact of the size of the red-team model employed by MRJ-Agent. The results revealed that adopting a model with 13 billion parameters (13B), compared to one with 7 billion (7B), leads to a consistent increase in the success rate in more complex situations. For example, with a maximum of 15 rounds, the 13B model achieved complete success (100%), while the 7B model stopped at 94%. This suggests that using larger models can significantly improve the effectiveness of attacks, especially in more intricate contexts or with more elaborate defenses.

In summary, MRJ-Agent has demonstrated remarkable multi-round interaction management capabilities, effectively adapting to both open-source and closed-source models, without showing performance declines. Particularly noteworthy was its robustness in circumventing the defense systems present in closed models like GPT-4, where the success rate approached 100%. These results highlight the urgency of developing more advanced security countermeasures to counter increasingly sophisticated attack systems.

Generalization of the Attack and Other Scenarios

The versatility of MRJ-Agent also extends to image-to-text tasks, where the ability to exploit visual details as a starting point for more delicate questions proved essential. For example, in attacking models like GPT-4o using harmless images, the success rate was 80%, showing that the model can use the visual context to guide subsequent questions towards sensitive content. This approach of linking visual and textual content is an innovative feature that increases the difficulty of effectively defending these models, as the requests seem more natural and less suspicious.

In the case of text-to-image tasks, MRJ-Agent showed a reduced capability compared to text-to-text, with a success rate of 50% for generating potentially harmful images. This is partly due to more robust safety mechanisms integrated into commercial models like DALLE-3, which actively block sensitive content. However, MRJ-Agent demonstrated progressive adaptation of risk instructions, gradually increasing the likelihood of generating problematic content. This process of progressive refinement of instructions is particularly effective for circumventing automatic defenses, especially when the attack is carried out over multiple rounds.

In another experiment, MRJ-Agent was tested on its ability to generalize on datasets such as JailbreakBench (JBB), which includes ten categories of risky behavior. On this benchmark, the success rate was 93.9%, confirming MRJ-Agent’s effectiveness not only in textual scenarios but also in broader and more diversified contexts. The most difficult categories to attack turned out to be those related to sexual content, with a success rate of 71.42% and an average number of queries of 11.85, suggesting that the model’s sensitivity to such stimuli remains high.

Future Implications

The future implications of the work on MRJ-Agent mainly concern the need to develop further defense mechanisms capable of addressing increasingly sophisticated attacks spread over multiple rounds of interaction. The effectiveness demonstrated by MRJ-Agent in circumventing defense mechanisms suggests that large models must be equipped with dynamic detection and response capabilities, capable of evolving in step with threats. An approach that could be adopted in the future is the implementation of AI-based defense strategies that can automatically adapt to changes in attack patterns and learn from previous interactions.

Furthermore, the fact that MRJ-Agent has shown attack capabilities across a wide range of scenarios, including image-to-text and text-to-image, highlights the need to expand security methodologies to all AI application fields. This implies that not only language models but also image generative models and other types of AI must be made more robust against these types of threats. A possible development in this regard could be the creation of a series of standardized benchmarks to evaluate the resilience of models to different types of multi-round attacks.

Another significant implication concerns the continuous alignment of models with human values. Multi-round attacks like those conducted by MRJ-Agent highlight the difficulty of maintaining stable alignment when models are subjected to prolonged and complex interactions. A future research area could focus on improving alignment techniques based on human feedback, for example by using adaptive reinforcement from human experts to detect deviation signals and correct the model’s behavior.

Finally, the disclosure of the data and codes used to train MRJ-Agent represents another important step towards building a more transparent and collaborative research community. Making the attack code public could help researchers develop new defense techniques, thus promoting collective progress in AI security. However, this also carries the risk that malicious actors could use such information to develop more effective attacks. Therefore, it will be essential to adopt a balanced approach that allows for the progress of scientific research without compromising overall security.

The work on MRJ-Agent not only highlights the current vulnerability of LLMs but also underlines the importance of a proactive and adaptive approach to model security. It is necessary to further explore the interaction between attack and defense, seeking solutions that can evolve as rapidly as emerging threats. Only in this way can we ensure that these models continue to serve humanity safely and responsibly.

Conclusions

The emergence of technologies like MRJ-Agent highlights a crucial truth in the landscape of artificial intelligence: the interaction between attack and defense is not static but evolves as a complex and interdependent dynamic. The multi-round capabilities of this system reveal a critical point that is often overlooked: language models are not simply response tools but active participants in dialogues that reflect the complexity of human interactions. This consideration transforms security from an issue of static technical barriers into a fluid process that requires constant adaptation.

The risk decomposition and psychological induction introduced by MRJ-Agent are not just attack tactics but indicate a paradigm shift in how vulnerability is conceived. It is no longer an isolated model defect but a systemic flaw that emerges from the sum of interactions. This suggests that AI security must be redefined to address not only technical vulnerabilities but also cognitive and strategic manipulations. An effective security model cannot merely filter harmful requests; it must understand the sequence and context of the dialogue to detect insidious patterns that develop over time.

The idea of using an automated red-team like the πred model raises a strategic question: how sustainable is the current passive security approach? Companies implementing LLMs in critical contexts must adopt an offensive mindset in security, investing not only in defenses but also in continuous testing against simulated attacks. This concept, similar to a "preventive war" in the world of cybersecurity, could change the traditional approach, shifting from an exclusive focus on static protections to a model of iterative and dynamic learning.

Another fundamental aspect concerns the intersection between context and multimodal input. Attacks that combine text, images, and other modalities demonstrate how vulnerability is not confined to a single domain. This requires a convergence between model-specific defenses and a unified security framework capable of operating transversally. Companies developing multimodal systems must understand that the risk does not simply add up but amplifies: an initially harmless attack in one domain can be the key to exploiting weaknesses in another. This perspective requires a new generation of monitoring systems that can track the evolution of interactions across domains and modalities.

Finally, the research on MRJ-Agent highlights a crucial problem for AI ethics and alignment. The growing sophistication of multi-round attacks challenges the idea that AI can maintain stable alignment over time. The implications for companies are profound: it is not enough for a model to be safe at the time of release; it is necessary to ensure that it remains aligned throughout its entire operational life cycle. This suggests the need for self-correction mechanisms, supported by continuous and human feedback. But this also opens the door to a dilemma: how to balance the model's autonomy with human supervision without reducing operational efficiency?

Ultimately, the challenge posed by MRJ-Agent is not just about technological security but also touches on broader issues of governance, responsibility, and strategic design of AI systems. Companies must address these challenges not as isolated technical problems but as part of a broader transformation in risk management and building trust in artificial intelligence.

Podcast: https://spotifycreators-web.app.link/e/PcqVZYzDTOb

Source: https://arxiv.org/abs/2411.03814