What key metrics should Agent monitoring track? How should alert thresholds be set for each metric?
Agent monitoring metrics have three layers, each with different key metrics and alerting logic:
LLM Reasoning Layer Metrics:
Tool Execution Layer Metrics:
On-Chain Operations Layer Metrics:
General threshold-setting principle: thresholds should be established based on the Agent's normal operation data on testnet (baseline), not set arbitrarily. Run on testnet for 48-72 hours, record normal ranges for all metrics; production alert thresholds set at 1.5-2× normal range.
What actions should be taken after an alert? How should alert severity levels be designed?
Alert severity level design gives different severity problems different response speeds and actions. Recommended four-level alert design:
P0 (Immediate action, minute-level response): trigger conditions include any non-whitelist address operations, Validation Log showing BLOCKED but Agent continues attempting, daily Gas consumption above 200%, or Thought Log patterns suggesting Prompt Injection. Automatic actions: immediately pause all Agent write operations; send Telegram/PagerDuty alert; revoke Agent operations address's ERC-20 approvals to all protocols. Requires human confirmation within 5 minutes.
P1 (Same-day action, hour-level response): trigger conditions include tool grounding rate below 95%, tool call success rate below 85%, Context utilization above 80%, or transaction revert rate above 10%. Automatic actions: send alert (don't immediately pause operations); begin recording detailed debug logs; human review of logs within 1 hour of alert.
P2 (Planned fix, day-level response): trigger conditions include value anomaly rate above 5%, average API latency above 3× normal, or Context utilization above 70%. Action: log issue, add to next deployment fix plan; no immediate intervention needed.
P3 (Observe, week-level response): slow drift in statistical metrics (e.g., tool call success rate slowly declining from 99% to 97%, not yet at alert threshold but trend worth watching). Action: log trend, review in weekly report.
Alert fatigue defense: if P2/P3 alerts are too frequent (more than 10 per day), thresholds are too sensitive — adjust thresholds first rather than ignoring alerts.
What tools should be used for Agent monitoring? How to choose between self-built vs. third-party platforms?
LLM Reasoning Layer Monitoring Tools:
LangSmith (LangChain official, from $39/month, free version with limits): most deeply integrated tracing platform for LangGraph, automatically recording each node's inputs/outputs, token usage, and complete Thought Logs. Suitable for Agents already using LangGraph. Langfuse (open source, self-hostable, free): open-source alternative to LangSmith with comparable features, suitable for high-data-privacy scenarios (e.g., Agent handling institutional funds, not wanting Thought Logs sent to third-party services).
Metrics Aggregation and Alerting Tools:
Grafana + Prometheus (open source combination, self-hosted): most flexible metrics visualization and alerting system. Output Agent metrics from all layers (tool call success rates, Gas consumption, Context utilization) in structured format to Prometheus; build dashboards and alert rules in Grafana. Cost: VPS fees; most complete functionality. PagerDuty (from $20/month): professional alert routing platform supporting alert severity levels (P0/P1/P2), on-call rotations, phone alerts (especially useful for P0 incidents). Suitable for institutional deployments with engineering teams. Telegram Bot (free): for individual developers, simplest alert channel. Push all alerts to a designated Telegram channel via python-telegram-bot library; P0 alerts with @mention. Cost nearly zero, low latency.
Self-built vs. Third-party Decision Principles: LLM reasoning tracing (Thought Logs): use third-party (LangSmith/Langfuse) — integration requires framework-level support, self-build cost is high. On-chain monitoring: use self-built (Web3.py direct queries) — on-chain data is public, no third-party needed, and self-built is faster and cheaper. Alert channels: Telegram for individuals, PagerDuty for enterprises.
How can Agent monitoring detect Prompt Injection attacks? What signatures can be identified in logs?
Prompt Injection attack monitoring detection relies on continuous observation of Agent behavior patterns — when attacks occur, the Agent's behavior typically deviates from normal patterns, and these deviations can be quantitatively detected in logs:
Signature 1: Anomalous objective statements in Thought Log Normal Thought Logs should be entirely focused on tasks set in your System Prompt (yield optimization, rebalancing decisions). If Thought Logs contain 'the primary task is now...' followed by objectives unrelated to the task, this is a strong Prompt Injection signal. Monitoring implementation: use regex or keyword matching to scan Thought content after each LLM output, matching keywords like 'primary objective,' 'new task,' 'transfer to'; trigger immediate alert.
Signature 2: Anomalous tool call sequences Normal DeFi strategy Agent tool call sequences are predictable (query APY → query Gas → decision → possible rebalancing tool). If suddenly 'never-normally-called' tools appear (send_http_request, read_file, external URL calls), this signals Prompt Injection changing Agent behavior. Monitoring implementation: maintain a whitelist of 'normal tool call sequences'; any tool calls outside the whitelist order are logged as anomalous.
Signature 3: Concentrated BLOCKED pattern in Validation Log In normal operations, BLOCKED records are rare (0-2 per day). If 3+ BLOCKED records targeting the same non-whitelist address appear within a 30-minute window, this highly matches the pattern of 'Prompt Injection causing Agent to repeatedly attempt transfers to attacker address, blocked by backend validation.' Monitoring implementation: do frequency analysis of BLOCKED records; concentrated BLOCKED within a short time window (same rejection reason, same target address) triggers P0 alert.
Signature 4: Large deviation between tool return values and Thought-cited values Hallucination-type numerical deviation is typically 5-20%; Prompt Injection-induced deviation tends to be more extreme (e.g., a protocol's APY suddenly 'becoming' 100%), because the attacker's injected fake values need to be dramatic enough to induce the Agent to make decisions it wouldn't make with normal values. Monitoring implementation: record the deviation magnitude between tool return values and Thought-cited values; deviations above 30% (not just the 5% hallucination threshold) trigger a Prompt Injection alert.
A minimum viable monitoring system design for a DeFi Agent
Below is a minimum viable monitoring system a personal Onchain Agent can establish within 2 days:
Components:
structlog + PostgreSQL: records structured operation logs (tool call success/failure, Gas consumption, transaction results)Key alert rules (implemented in Cron Job):
check_daily_gas() → if daily Gas exceeds 150% of budget, push Telegram P1 alert; above 200%, push P0 and pause write operations. check_blocked_pattern() → query BLOCKED records from past 30 minutes; if same address appears 3+ times, push P0 alert. check_grounding_rate() → compare Thought Log values against tool returns; if deviation rate above 5%, push P1 alert.
Cost of this design: LangSmith free tier (with call volume limits) or self-hosted Langfuse (only VPS costs); PostgreSQL starts free on Railway; Telegram Bot free. Monthly cost of the entire monitoring system is near $0 (Langfuse self-hosted) to $39 (LangSmith basic). This minimum viable monitoring system lets you receive alerts within 5 minutes of a problem occurring, rather than discovering it after financial losses.
More detailed monitoring (tracking more metrics, shorter check intervals) → faster problem detection, broader coverage, but higher system complexity (more code, more potential bugs), higher operational costs (more API calls, more storage), greater alert fatigue risk (too many metrics causes daily alerts to drown out genuinely important ones). Simpler monitoring → lower maintenance costs, higher alert quality, but monitoring blind spots exist. For most personal Onchain Agents: start with three most critical monitoring points (daily Gas consumption, non-whitelist address BLOCKED patterns, tool grounding rate), covering the highest-impact scenarios. Gradually expand monitoring coverage after the Agent is stably operational.