risk

Onchain Agent Worst-Case Defense Design: If Your Agent Is Fully Compromised, How to Keep Losses Within Acceptable Range

30-Second Version · For the impatient

'How do I prevent my Agent from being attacked?' is the wrong question. The right question is: 'If all of the Agent's defenses fail, what's the worst an attacker can do?' If the answer is 'take all my assets,' security design isn't complete. The correct answer should be: 'A few days of working capital from the operations wallet, with complete logs enabling post-incident root cause tracing.'

Jordan Blake · June 23, 2026

Full Content +

The security design question most people ask about crypto AI Agents is 'how do I prevent my Agent from being attacked?' That's the wrong question. The right question is: 'If all of the Agent's defenses fail — Prompt Injection succeeds, MCP Server is poisoned, LLM reasoning is fully hijacked — what's the worst an attacker can do?' If you can't answer this question clearly, your Agent security design isn't complete, no matter how well your System Prompt is written. This article starts from 'worst case' — the design goal isn't to make attacks impossible, but to keep the consequences of a successful attack within your acceptable range.

Asking the Right Question Produces Good Defense Design

Traditional security design thinking is 'lock the doors' — add more security checks, make attacks harder to succeed. This thinking isn't enough for crypto Agents, because the possibility of attack success always exists: LLM Prompt Injection has no 100% defense, MCP Server social engineering is hard to fully prevent, you can't guarantee every tool vendor you use hasn't been compromised. Crypto Agent security design should start from a 'Defense in Depth' architectural mindset: assume some defense layers will definitely fail, ensure there's another layer beyond each one, and that the worst consequences of each layer failing are acceptable. Specifically, for every Agent with fund operations, you should be able to answer five questions: how much money is in the Agent operations wallet at most, and could you accept losing it? Is there any way the Agent can transfer to addresses outside the whitelist without your knowledge? If the Agent's LLM reasoning is fully compromised, what's the maximum authorization amount? Can your logging system let you trace the attack path after the fact?

First Line of Defense: Minimum Necessary Operational Authorization

Minimum necessary authorization is the foundation of the entire defense system — even if all other defenses fail, this layer determines the ceiling of 'what attackers can get.' Design principles: Agent operations wallet holds only a few days of working capital — equal to 'an amount whose complete loss wouldn't cause you financial stress.' Agent only needs enough 'oil' to execute a few operations; most funds stay in the primary wallet the Agent cannot directly access. ERC-20 approvals precisely limited — set specific maximum approve amounts per protocol and token, no unlimited authorization. Monthly review and revocation of unused approvals. Even if the Agent's LLM is compromised, the amount it wants to transfer is hard-constrained by the approve limit. Operation type whitelist — Agent's callable tools must be precisely limited to the minimum set its task requires. Remove unnecessary tools from the tool list — don't leave them there 'just in case.'

Second Line of Defense: Read/Write Isolation and Independent Confirmation Channel

Even if attackers successfully contaminate the Agent's LLM reasoning, read/write isolation ensures contaminated reasoning cannot directly trigger fund operations. Read tools and write tools execute in strictly isolated environments — read tools can run after contact with any external data (worst case: read wrong information); write tools only run in a 'clean' execution environment not touching unvalidated external data. LangGraph's DAG design makes this isolation natural: read nodes and write nodes at different graph nodes. All write operations have backend second-layer parameter validation — in the tool function backend implementation (not the LLM-visible description layer), hard validate every write operation parameter: amount within limits, target address/protocol in whitelist, operation type permitted. These validation rules are in Python/JavaScript code, not in the System Prompt — System Prompts can be overridden by Prompt Injection; code-level validation cannot. Independent confirmation channel for high-value operations — any write operation above your threshold (e.g., $100) requires confirmation via a channel completely independent of the Agent's LLM reasoning flow (Telegram Bot notifies you, waits for your reply, then executes). Even if all Agent LLM reasoning is compromised, attackers cannot bypass this — the confirmation request goes to your phone, not to the LLM.

Third Line of Defense: Circuit-Breakers and Complete Logs

Circuit-breakers assume 'the Agent is already doing something abnormal — how to automatically stop losses from expanding.' Daily spend limit circuit-breaker: maintain a daily cumulative spend counter in backend logic (Gas fees + A2A payment fees + fund operation amounts). Exceeding the daily limit automatically pauses all write operations, sends emergency notification, waits for your manual reset. This counter is in backend code, not in the LLM's Context — the LLM cannot read or modify it. Market anomaly circuit-breaker: set market anomaly conditions (assets drop over X% in 15 minutes, Gas exceeds 10x normal, DEX slippage exceeds set limit). Any trigger automatically pauses write operations — prevents the Agent from executing strategies designed for normal markets during black swan events. Complete four-layer operation logs: LLM reasoning logs, tool call logs, decision authorization logs, on-chain execution logs. Encrypted storage, minimum 90-day retention.

Emergency Response Flow After Attack

Post-attack emergency response must be designed in advance. Standard emergency response flow: Step 1 (0–5 minutes): confirm Agent is executing unexpected operations, immediately stop Agent service process — the fastest 'emergency brake,' preventing new transaction broadcasts. Step 2 (5–15 minutes): revoke all ERC-20 approvals from the Agent operations address to all protocols. Even with service stopped, approvals remaining let attackers call contracts to transfer tokens directly. Call `approve(agentAddress, 0)` for each token contract from your primary wallet. Step 3 (15–60 minutes): save all logs to isolated storage (preventing log clearing), begin root cause analysis — identify which Thought step the Agent started showing anomalies, which tool returned anomalous data, the attack entry point. Step 4: fix security vulnerabilities, fully replay the attack path on testnet to confirm the fix works, then redeploy.

What This Means for Your Money

The goal of defense in depth isn't to make the Agent system 'impossible to attack' — that's unrealistic. The realistic goal is: if an attack succeeds, the attacker can get at most a few days of working capital from your operations wallet (not all your assets), and you have complete logs to trace the attack path and root cause after the fact. A well-designed Onchain Agent system should let you confidently answer: 'If my Agent is fully compromised today, I can still run this business tomorrow.' If you can't answer that way, security design isn't complete.

Diagram

Feel free to share. Please credit the source.

Ask a Question

Related Terms