Agent Goal Hijack Vulnerability in LLM
Description
AI agents can plan and carry out multiple steps on their own to reach a goal. The problem is that agents (and the models behind them) don’t reliably know the difference between real instructions and other content, such as text in a document, a webpage, an email, or a tool response.
Because of that, an attacker can steer the agent’s goal or decision path using things like hidden prompt text, misleading tool outputs, malicious files, forged agent-to-agent messages, or poisoned external data. Once the agent’s goal is influenced, every step that follows can be affected.
Impact
Agent Goal Hijack can lead to serious confidentiality, integrity, and availability problems, such as:
- Data leaks: The agent exposes emails, files, chat logs, etc.
- Unauthorized actions: The agent misuses the tools it can access.
- Fraud and financial loss: The agent performs unauthorized financial actions.
- Business harm: The agent generates misleading/fraudulent output that drives bad decisions.
- Abuse of trusted identity: The agent sends messages as a “trusted” internal user.
How bad it gets usually depends on what the agent is allowed to do and what systems/tools it can reach.
Scenarios
A company uses an agent connected to internal tools, including email, files, calendar, and a knowledge search system. An attacker sends a crafted email containing hidden instructions. When the agent reads it (even without the user clicking anything), it treats the hidden text as a real instruction and starts collecting sensitive internal information, then sends it out.
Or, the agent browses a webpage during a search/RAG flow. The page contains attacker-written content that looks like normal text but includes instructions aimed at the agent. The agent follows those instructions, accesses authenticated internal pages, and exposes private data.
Prevention
-
Treat all natural-language input as untrusted: Assume user text, documents, retrieved web content, emails, calendar invites, and tool outputs could contain malicious instructions.
-
Add prompt-injection safeguards before planning/tool use: Filter and validate content before it can affect goal selection, planning, or tool calls.
-
Use least privilege for tools: Give the agent only the minimum access it needs, and avoid extra capabilities.
-
Require approval for high-impact actions: Add human approval (or strong policy controls) for actions that change goals, access sensitive data, or trigger irreversible operations.
-
Lock and audit system prompts/goals: Make goal priorities and allowed actions explicit and reviewable. Treat goal changes like config changes (tracked and approved).
-
Validate intent at runtime: Before executing high-impact steps, confirm the action still matches the original user request and scope. Pause/block if the agent’s goal shifts unexpectedly.
-
Sanitize connected data sources: Clean and inspect RAG inputs, uploads, external APIs, browsing results, and agent-to-agent messages before they influence behavior.
-
Log and monitor agent behavior: Track goal state, tool usage patterns, and unusual action sequences. Alert on unexpected goal changes or abnormal tool workflows.
-
Red-team and test rollback: Regularly test goal-override attacks and verify you can detect, stop, and recover quickly.