A taxonomy of AI agent attacks: 19 categories, 268 rules
The threat surface we mapped building Aguara, and the failure modes that recur across real deployments.
- –AI agents are attacked through inference, not just input. The classic threat model doesn't fit.
- –MCP servers are the softest surface: every connected server is an implicit trust decision.
- –Across 268 rules in 19 categories, three failure modes dominate: over-broad permissions, unvalidated tool output, and prompt injection.
Most teams shipping AI agents today inherited a threat model built for software that does what it's told. Agents don't do what they're told. They do what they infer. That gap is the whole problem.
Why MCP changes the surface
The Model Context Protocol connects agents to tools, files, and other systems. It's the most useful thing to happen to agents in years, and also the place where the most damage can be done. Every server an agent trusts is a new entry point, and most are trusted implicitly.
An agent is only as safe as the least-reviewed server it's allowed to call.
What we measured
Across 268 rules in 19 categories, the same failure modes recur: over-broad permissions, unvalidated tool outputs, and prompt-injection paths that turn a helpful agent into a confused deputy. We tracked 58,000+ skills across the major registries to see how widespread these patterns really are.
$ aguara scan ./agent
▸ 19 categories · 268 rules
✗ 3 high · over-broad MCP scope
✗ 7 medium · unvalidated tool output
✓ 258 passed
How to think about it
Treat every external capability as untrusted until proven otherwise. Enforce policy at runtime, not just in review. And measure continuously. The ecosystem changes faster than any audit cycle.