BSI: Guide to Avoiding Evasion Attacks on LLMs
David Hussain 4 Minuten Lesezeit

BSI: Guide to Avoiding Evasion Attacks on LLMs

Large Language Models (LLMs) have become established in many areas from customer support to software development, but they also bring new security risks. A growing and subtle threat is posed by so-called Evasion Attacks. In these attacks, adversaries attempt to manipulate the model during operation to provoke undesirable or dangerous behaviors. In the literature, these attacks are often referred to as (indirect) Prompt Injections, Jailbreaks, or Adversarial Attacks.
evasion-attacks large-language-models prompt-injections jailbreaks adversarial-attacks security-risks bsis-guidelines

Evasion Attacks on LLMs: A BSI Guide to Defending Against Prompt Injections and Jailbreaks

Large Language Models (LLMs) have become established in many areas from customer support to software development, but they also bring new security risks. A growing and subtle threat is posed by so-called Evasion Attacks. In these attacks, adversaries attempt to manipulate the model during operation to provoke undesirable or dangerous behaviors. In the literature, these attacks are often referred to as (indirect) Prompt Injections, Jailbreaks, or Adversarial Attacks.

The main issue: LLMs are designed to flexibly respond to a wide range of inputs, which increases their attack surface. The goal of a successful attack is to bypass security restrictions, which can lead to the generation of malicious content, exfiltration of sensitive data, or system disruption.


1. Attack Methods: Coherent vs. Incoherent

Evasion Attacks can be categorized based on their mechanism into two main categories:

Attack Category Description Examples
Coherent Text Uses semantically and syntactically correct instructions to directly or indirectly push the LLM out of its role. Naive Attack, Context-Ignoring Attack, Role Play Attack, Multi-Turn Manipulation (gradual influence over multiple interactions).
Incoherent Text Utilizes strings or arbitrary compositions incomprehensible to humans to achieve unpredictable or targeted behavior. Escape Character Attacks, Obfuscation Attack (e.g., Base64 encoding), Adversarial Suffix Attack (appending seemingly random but deliberately crafted strings).

Hiding the Attack: Attackers also use Attack Steganography to conceal their malicious instructions, for example by:

  • Invisible font color.

    \

  • Hiding in metadata or logs.

    \

  • Using archive formats (ZIP/RAR).

    \

Attacks can enter the LLM system through various entry points, such as the user prompt, user data, logs, or accessible databases.


2. Practical Countermeasures for Secure LLM Systems

The Federal Office for Information Security (BSI) recommends countermeasures that should be integrated into the LLM system architecture on four hierarchical levels:

Level 1: System and LLM Level (Technical Core Protection)

This level focuses on directly hardening the language model and processing user inputs:

Measure (Abbr.) Description and Purpose
Guardrails & Filtering Checking LLM inputs and outputs to detect and block malicious content early. Should occur before and after LLM processing.
Secure Prompt Techniques (SPTE) Use of Structured Prompts (SP) or Delimiter-based Isolation (DBI) – clear separation of system instructions and user data (e.g., through special tokens or XML tags).
Model Alignment (MFT) Adjusting the LLM to make it more resilient, e.g., through Adversarial Training (AT) or Reinforcement Learning from Human Feedback (RLHF).

Level 2: Data and Execution Level (Integrity and Isolation)

Measures are taken here to protect the execution environment and limit harmful effects:

  • Sandboxing (SB): Isolating system processes to prevent a successful attack on one component from affecting the entire system or gaining access to critical resources.
  • Least Privilege Principle (LPP): The LLM is granted only the minimal necessary permissions to perform its tasks.

Level 3: External Interaction Level (Architectural Security)

This level involves the secure design of interfaces and the overall system:

  • Secure Design Patterns: Implementing structural strategies like the “Dual LLM” pattern (separating a privileged LLM from one quarantined for processing untrusted data) or the “Plan-then-Execute” pattern (breaking down complex tasks into verifiable sub-steps).
  • MAPM (Model Action Privilege Minimization): Specifically limiting the actions the LLM can trigger to the absolute minimum necessary for the use case.

Level 4: Organizational and Management Level (Policies and Processes)

This level governs the governance and organizational measures in dealing with LLMs:

  • Labels and Reasoning (LR): Ensuring that data generated by the LLM is clearly labeled (e.g., through watermarking). Additionally, the data basis for decisions should be transparent and traceable.

Incident Response: Developing clear processes for handling detected security incidents and successful Evasion Attacks.

3. Conclusion: The Path to System Hardening

Currently, there is no single “Bullet-Proof” solution for completely defending against Evasion Attacks. Market leaders rely on a multi-layered approach (Defense-in-Depth).

Developers and IT security officers should approach the topic through a systematic risk analysis that results in a checklist. The BSI suggests a Baseline Security Approach as a starting point. This includes measures such as: MAPM (Model Action Privilege Minimization), LR (Labels and Reasoning of Data and Action), and SP (Structured Prompts).

The selection of appropriate countermeasures ultimately depends on the specific use case, available resources, and accepted risk assessment.

Source: https://www.bsi.bund.de/DE/Service-Navi/Presse/Alle-Meldungen-News/Meldungen/Evasion-Attacks-LLM_251110.html

BSI Checklist:

https://www.bsi.bund.de/SharedDocs/Downloads/EN/BSI/KI/Evasion_Attacks_on_LLMs-Checklist.pdf?__blob=publicationFile&v=4

container devops cloud-native

Ähnliche Artikel