Unbounded Consumption Vulnerability in LLM
Description
Due to how large language models (LLMs) work, attackers can interact with them in ways that cause high resource consumption. This can result in degraded service quality, reduced availability, and delayed responses for both the attacker and other legitimate users. These attacks usually take advantage of the LLM’s inference process, where the model generates responses based on what it’s learned.
More advanced attacks target the model’s context window, which is the maximum amount of text it can handle at once. Attackers send continuous or oversized inputs, leading to unbounded consumption of computational resources. This can escalate into denial of service (DoS), unexpected costs from overuse, or even attempts to steal the model’s behavior and functionality.
Impact
Denial of Service (DoS) attacks and other resource-heavy exploits can seriously slow down or even take LLM applications offline, making them unusable for end users. In cloud environments where compute usage drives costs, these kinds of attacks can quickly result in unexpected bills. They also put extra strain on infrastructure, forcing service providers to spend more on defenses and recovery efforts.
Beyond performance and financial implications, uncontrolled consumption can cause intellectual property loss, especially when attackers use APIs to replicate model behavior or functionality. Additionally, such attacks can compromise service reputation and user trust.
Scenarios
Consider an LLM-powered travel assistant deployed at an airport to deliver real-time flight updates to passengers, including delays, cancellations, and boarding notices.
In this setting, an attacker submits an unusually large and complex request or floods the system with repeated inputs, overloading the assistant. This prevents the model from processing real-time updates promptly. As a result, customers might receive critical information too late, leading to missed flights and disruptions.
Other scenarios include:
-
A malicious actor sending a continuous stream of inputs that push the model’s memory and context window limits, eventually causing operational failures.
-
An attacker exploiting cloud-based pay-per-use inference to trigger large-scale operations, driving up costs in a Denial of Wallet (DoW) attack.
-
Functional model replication, where an attacker systematically collects outputs and trains a parallel model, duplicating capabilities without proper authorization.
Prevention
- API rate limiting: Apply request limits based on user sessions or IP addresses within a defined time window to prevent flooding and overuse of the system.
- Sanitation and validation: Utilize filters and validation techniques to prevent malicious inputs.
- Step and query-based resource cap: For complex queries involving responses in steps or stages, generate responses slowly to prevent overuse of resources.
- Monitor: Monitor for usage spikes that might indicate DoS and nefarious activity, while also making developers aware of the types of methods used to target LLM for DoS.
- Timeouts and throttling: Introduce timeouts for long-running operations and throttle response rates to restrict overuse of computational resources.
- Access control and sandboxing: Enforce strong access policies and isolate the LLM from external networks and internal services to reduce attack surfaces.
- Watermarking and logging: Embed watermarks in model outputs and maintain comprehensive logs to trace usage patterns and detect unauthorized model replication.
- Graceful degradation: Design systems to reduce functionality or response complexity under high load conditions, instead of failing entirely.
- Glitch token filtering: Identify and filter out known glitch tokens or harmful patterns before the model processes them.
- Secure deployment workflows: Use MLOps practices with governance and approval workflows to prevent unauthorized changes in production environments.