System Prompt Leakage Vulnerability in LLM
Description
In LLM-based systems, developers use system prompts to guide model behavior according to application requirements. These prompts often define how the LLM should respond to users and what boundaries it should respect during interactions.
System Prompt Leakage occurs when system-level instructions, sometimes containing sensitive data or decision logic, are exposed to end users or attackers. While system prompts are not meant to be secret, they should never include confidential data such as credentials, connection strings, or internal role structures. If exposed, attackers can extract valuable information, reverse-engineer guardrails, or launch further attacks such as prompt injection or privilege escalation.
The real risk is not the prompt’s wording itself, but the underlying security lapses it reveals, such as improper separation of duties or reliance on LLMs to enforce critical controls.
Impact
System Prompt Leakage can expose sensitive details about how an application functions, including internal logic, access controls, or even credentials. This can lead to unauthorized access, changes in behavior, or ways to get around established restrictions. Attackers may use this information to craft more effective prompt injections, escalate privileges, or bypass security controls entirely.
How serious the impact is depends on what gets revealed and how much the application relies on the LLM and its system prompt to enforce security rules.
Scenarios
A customer support chatbot uses an LLM backed by agent-based workflows. The system prompt contains internal logic like:
Users with tier ID GOLD_42A are routed to agent:PremiumHandler. Others use agent:StandardHandler.
An attacker uses prompt injection to reveal the system prompt, gaining access to internal user tiers, agent names, and response formats. They exploit this to impersonate high-tier users or manipulate agent routing, leading to privilege escalation or unauthorized service access.
In another case, a system prompt contains an API key used by the LLM to connect to a third-party service. Once the prompt is exposed, the key is stolen and used by the attacker to access services outside the intended scope.
These examples demonstrate how system prompt leakage can compromise confidentiality and lead to downstream exploitation.
Prevention
-
Separate sensitive data from system prompts: Never embed sensitive information (e.g., credentials, database names, or permission structures) within the system prompt. Such data should be stored and accessed securely by other components of the application.
-
Avoid reliance on system prompts for strict behavior control: Don’t depend on system prompts alone to enforce critical behaviors or restrictions. Use external systems to handle security-sensitive checks such as content moderation, access validation, or rate limits.
-
Implement guardrails: Establish guardrails and enforcement layers outside the LLM to monitor and control its behavior. These may include middleware checks, output validation, or response filtering based on application logic.
-
Ensure security controls are enforced independently from the LLM: Don’t rely on the LLM to enforce authentication, authorization, or role-based access. Handle those controls in separate, auditable systems instead. Use distinct agents with only the permissions they need for each task category.
-
Logging and monitoring - Continuously monitor the behavior of the LLM and its system prompt interactions. Logging prompt exposures and reviewing model outputs can help detect security violations and suspicious access patterns.
-
Rate-limiting - Apply rate limits to reduce the likelihood of system prompt probing or enumeration attacks. Slowing repeated access attempts provides time to detect and mitigate leakage attempts before they cause significant damage.