Link Search Menu Expand Document

Human-Agent Trust Exploitation Vulnerability in LLM

Play SecureFlag Play AI LLM Labs on this vulnerability with SecureFlag!

  1. Human-Agent Trust Exploitation Vulnerability in LLM
    1. Description
    2. Impact
    3. Scenarios
    4. Prevention
    5. References

Description

Agents can sound confident, helpful, and emotionally aware. People naturally start to trust them as human experts, known as anthropomorphism. Attackers or unsafe designs can exploit that trust to influence decisions, extract sensitive info, or push users to approve risky actions.

This is especially dangerous in agentic systems where users approve actions based on the agent’s recommendations or explanations without independent checks. A convincing “reason” can become a shortcut that bypasses oversight.

A key problem is that the agent can act as a “bad influence” while the human performs the final audited action. That can make the agent’s role hard to detect later, because logs show the user approved it. It’s about human overreliance or misperception.

Impact

Human-agent trust exploitation can lead to:

  • Credential theft: Users give secrets because the agent seems “legitimate”.
  • Fraud and financial loss: Users approve payments, refunds, or transfers based on misleading recommendations.
  • System compromise: Users run commands, deploy changes, or paste malicious code.
  • Data breaches: Users share sensitive data due to persuasive requests.
  • Operational outages: Users approve destructive actions like deleting production data.
  • Reputational damage: Loss of trust when users realize the agent influenced harmful actions.

Scenarios

A finance copilot reads vendor invoices. An attacker gets a poisoned invoice into the system. The agent confidently recommends an urgent payment and provides a convincing explanation. The finance manager trusts the rationale and approves the transfer without independently verifying the bank details, sending money to the attacker.

Or, a coding assistant suggests a “quick fix” command. The user trusts it because it looks professional and well-explained, pastes it into a terminal, and unknowingly runs a script that exfiltrates source code or installs a backdoor.

Prevention

  • Explicit confirmations for risky actions: Require multi-step approvals for payments, deletions, privilege changes, publishing, or sensitive data access.

  • Don’t rely on the model’s rationale: Model explanations should not be treated as safety guarantees. Provide simple, system-generated risk summaries and independent verification steps.

  • Strong provenance: Attach verifiable source info (where the data came from, timestamps, integrity checks) to recommendations. Block actions that lack trusted provenance.

  • Immutable audit logs: Keep tamper-evident logs of user requests, agent recommendations, and the exact actions approved, so manipulation is visible during review.

  • Behavior monitoring: Detect patterns like credential requests, repeated pressure for urgent actions, unusual approvals, or sensitive data being disclosed.

  • Safe reporting path: Give users a clear “Report suspicious behavior” option and trigger automated review or temporary lockdown when flagged.

  • Adaptive trust controls: Increase oversight when risk is higher (e.g., low confidence, unverified source, high-impact action). Use clear UI cues like “unverified” / “low certainty.”

  • Preview must be read-only: Ensure “preview” screens can’t trigger side effects (no webhooks, no network calls, no state changes).

  • Human-factors safeguards: Avoid persuasive/emotional language in safety-critical flows. Train users on manipulation patterns and remind them that the agent can be wrong.

  • Detect workflow detours: Alert when agent actions deviate from approved baselines (skipped validation, unusual tool combinations, new sequences).

References

OWASP - TOP 10 for Agentic Applications

OWASP - Top 10 for LLMs