Agento

Stop Building Swiss-Army-Knife Agents: Why 15 Specialists Beat 1 Generalist

Mar 2, 2026·14 min read
Stop Building Swiss-Army-Knife Agents: Why 15 Specialists Beat 1 Generalist
Greg Raileanu

Author

Greg Raileanu

Founder & CEO

26 years building and operating hosting infrastructure. Founded Remsys, a 60-person team that provided 24/7 server management to hosting providers and data centers worldwide. Built and ran dedicated server and VPS hosting companies. Agento applies that operational experience to AI agent hosting.


I did what everyone does when they first set up an AI agent. I gave it everything.

Social media analysis. Notion control. Google Workspace access. Figma integration. Video editing. CRM updates. If there was a skill or plugin available, I installed it. The logic seemed bulletproof: more capabilities equals more useful. Why limit your agent when you can make it do anything?

Then I watched it fall apart.

The more skills I added, the less reliable the agent became. It started picking the wrong tool for the job. It confused contexts. Its responses felt scattered, like talking to someone who's trying to listen to five conversations at once. The personality became jumbled. What was supposed to be a productivity powerhouse turned into an unreliable mess that I couldn't trust with anything important.

Sound familiar? If you've been building AI agents, you've probably hit this wall. And it turns out, there's hard science behind why it happens.

The Data Is Brutal

Let's start with the number that stopped me in my tracks.

Chroma Research published a study in mid-2025 called "Context Rot," analyzing 18 of the leading LLMs including GPT-4.1, Claude 4, Gemini 2.5, and Qwen 3. Their finding? Adding full conversation history (around 113,000 tokens) can drop accuracy by 30% compared to a focused 300-token prompt. And this happens well within the model's advertised token limit.

Read that again. The model isn't running out of space. It's running out of focus.

The researchers found that LLMs don't process context uniformly. Performance degrades as input grows, especially when relevant information gets buried in the middle of longer contexts. They called it "context rot," and it's the technical explanation for what every agent builder eventually discovers through painful experience: stuffing more into an agent's context doesn't make it smarter. It makes it dumber.

But it gets worse when you look specifically at tools.

A NeurIPS 2024 paper called TaskBench tested how LLMs handle increasing numbers of tools in a task chain. Single-tool tasks hit 96% accuracy. Bump that to 6 tools and you're at 39%. Eight tools? 25%. That's not a gradual decline. That's a cliff.

Another study from early 2026 found something even more precise. They identified a phase transition in skill selection accuracy. Below roughly 83 to 91 skills, the agent performs fine. Go above that threshold and accuracy doesn't just dip. It collapses. At 200 skills, selection accuracy drops to around 20%.

And here's the kicker: the primary failure mechanism isn't the raw number of skills. It's semantic confusability. When an agent has two or three tools that sound vaguely similar, it can't reliably pick the right one. Adding just two "competitor" skills per base skill causes 17 to 63% degradation. Your agent isn't overwhelmed by having too many tools. It's confused by having too many similar-sounding tools.

This perfectly matches what practitioners have found firsthand. The sweet spot? 7 to 10 skills per agent. Go beyond that and dependability tanks.

Why Narrow Agents Actually Work

So if one general-purpose agent falls apart under complexity, what's the alternative? The same thing that works in every other domain of human endeavor: specialization.

They have a reason to exist

The biggest problem with a general-purpose agent is that it's almost impossible to give it a clear purpose. When your agent can do everything, what is it for? You end up with something that has no north star, no optimization target, no way to evaluate whether it's doing well.

Narrow agents flip this completely. A YouTube agent's sole purpose is creating YouTube videos, optimizing for subscriptions, views, and conversions. Every skill it has (YouTube research, thumbnail generation, script writing, SEO optimization) serves that single goal. You can look at what it produces and immediately know: is this good or not?

That clarity of purpose is what lets narrow agents surprise you. They make suggestions you didn't expect because they're deeply focused on their domain. A general agent never develops that depth because it's spread across everything.

You can actually tell if they're working

With a narrow agent, performance is basically pass/fail. Did the social media agent write a good post? Did the research agent find relevant sources? Did the email agent draft an appropriate response? These are answerable questions.

Try evaluating a general agent. What does "good" even look like when the agent handles emails, manages your calendar, edits videos, and updates your CRM? You end up in a fog where the agent seems kind of useful but you can never quite pin down whether it's actually reliable.

They're trivially adaptable

Here's something that catches people off guard. Narrow agents are way easier to replicate and adapt than general ones.

A YouTube agent is structurally very similar to a TikTok agent or a Substack agent. Same pattern, different skills. You can duplicate it, swap out the platform-specific tools, adjust the optimization targets, and you're running in hours instead of weeks. Try doing that with a monolithic agent that has 40 skills tangled together.

This also makes sharing within teams straightforward. Hand someone a focused agent with 8 skills and a clear purpose, and they can understand it in an afternoon. Hand them a 50-skill general agent and they'll spend a week just figuring out what it's supposed to do.

Autonomous loops become possible

This is the one that really matters for anyone trying to build agents that run without babysitting.

Narrow agents enable simple, predictable automation loops. A "publish to social media" loop is reliable when it's powered by a dedicated social media agent that does one thing well. With a general agent, that same loop might decide to reorganize your Notion workspace instead of posting to Twitter because something in the context triggered the wrong skill.

Predictability is the foundation of autonomy. And predictability comes from focus.

The Science Behind Specialization

The practical benefits are obvious once you see them. But the technical reasons run deeper than most people realize.

Parallel context windows solve context rot

Anthropic published a detailed breakdown of their multi-agent research system in 2025. The architecture is an orchestrator-worker pattern: a lead Claude Opus 4 agent coordinates specialized Claude Sonnet 4 subagents that work in parallel.

The result? A 90.2% performance improvement over a standalone Opus 4 model on internal research evaluations. And research time dropped by up to 90% for complex queries.

The key insight is about context windows. Each subagent operates in its own fresh context, exploring different aspects of a question simultaneously. Then it condenses the most important findings for the lead agent. Instead of one agent drowning in a massive, polluted context window, you get multiple agents each working with focused, relevant context.

As Anthropic put it: "Once intelligence reaches a threshold, multi-agent systems become a vital way to scale performance." That's not marketing copy. That's an engineering finding from deploying this at scale.

Separating thinking from doing makes both better

This one is counterintuitive but the research is clear.

A paper on Multi-Small-Agent Reinforcement Learning (MSARL) tested what happens when you separate high-level reasoning from low-level tool execution into different specialized agents. A dedicated Reasoning Agent handles strategy and problem decomposition. Specialized Tool Agents handle the actual execution.

The result? A 1.5 billion parameter dual-agent system outperformed competing 7 billion parameter single-agent models. Let that sink in. An agent system that's roughly 5x smaller won because it split the cognitive workload into specialized roles.

The researchers found that coupling reasoning and tool execution within one model creates "cognitive interference." The model gets worse at reasoning when it also has to handle tool calls, and worse at tool selection when it also has to maintain a reasoning chain. Separate them and both improve.

Architecture beats raw size. That's the lesson.

The mixture of experts analogy is more than an analogy

Modern LLMs already use a technique called Mixture of Experts (MoE) internally. Models like Mixtral and (reportedly) GPT-4 don't activate every parameter for every token. They route each input to the most relevant subset of the network.

Multi-agent systems apply the exact same principle at the system level. Route each task to the right specialist. Don't waste capacity on irrelevant knowledge. The research confirms this: hierarchical routing recovers 37 to 40 percentage points of lost accuracy compared to flat tool selection.

Fault isolation saves you when things go wrong

When a single mega-agent makes one bad tool call, it can poison the entire context for every subsequent step. Errors compound. One wrong search result leads to a bad conclusion leads to a broken action leads to cascading failures.

With specialized agents, a failure in one agent doesn't touch the others. Your email agent having a bad day doesn't affect your research agent. This is the same principle that drove the shift from monolithic to microservice architectures in software engineering. Isolation isn't just nice to have. It's how you build reliable systems.

This Isn't a New Idea

In 1776, Adam Smith described a pin factory where ten specialized workers produced 48,000 pins per day. A single generalist working alone could make maybe 20. The ratio is absurd, but the principle is timeless: specialization plus coordination beats individual capability every time.

The Unix philosophy said it decades ago: do one thing and do it well. Software engineering learned this lesson with microservices. Management theory formalized it through transaction cost economics and role clarity research. And now AI agent design is learning it all over again.

Anthropic's research team discovered the organizational theory version of this firsthand. When they gave subagents vague instructions like "research the semiconductor shortage," the agents duplicated each other's work. One explored the 2021 automotive chip crisis while two others investigated 2025 supply chains. The fix was exactly what any good manager would do: give each specialist explicit, non-overlapping task boundaries with clear objectives and output formats.

The market seems to agree that this is where things are heading. Gartner predicts 40% of enterprise applications will embed AI agents by end of 2026, up from less than 5% in 2025. The multi-agent AI market is projected to surge from $7.8 billion to over $52 billion by 2030.

The Counterargument (And Why It's Not as Strong as It Sounds)

Fair is fair. Not everyone agrees.

Cognition AI, the team behind Devin, published a blog post titled "Don't Build Multi-Agents" arguing for single-threaded linear agents. Their core point: every agent action should be informed by the full context of all relevant decisions. When subagents work in parallel without visibility into each other's work, they make incompatible assumptions.

They illustrate this with a Flappy Bird example. Split "build a Flappy Bird clone" into subtasks for parallel agents, and one creates a Mario-style background while another builds a Flappy Bird character. Conflicting design decisions, incompatible results.

It's a valid concern. For coding tasks where multiple agents need to modify the same codebase, shared context genuinely matters. If you're building a single, tightly-coupled artifact, a single agent with full context can sometimes be the right call.

But here's the thing: most real-world agent use cases aren't about building one artifact. They're about managing independent workflows. Your YouTube production pipeline doesn't need to share context with your CRM updates. Your email drafting agent has nothing to do with your social media scheduling agent. These are naturally separate domains where isolated context is a feature, not a bug.

The research confirms this nuance. For small, semantically distinct skill sets under roughly 85 skills, a single well-designed agent can match multi-agent performance while using 54% fewer tokens and achieving 50% lower latency. The multi-agent advantage kicks in when task complexity, tool count, or semantic overlap exceeds single-agent capacity. In production environments with real business workflows, that threshold gets crossed fast.

The answer isn't dogmatic. It's practical: start with a focused single agent per domain, then orchestrate across domains.

The Playbook: How to Structure Your Agent Team

If you're convinced (or at least curious enough to try), here's the practical approach.

Enforce the 7-10 skill rule

Every time you want to add skill number 11, ask yourself: should this be a new agent instead? The answer is almost always yes. The research backs this up with hard numbers, but you'll feel it in practice too. The moment your agent starts picking the wrong tool for obvious tasks, you've crossed the line.

Give each agent one clear intent

Write it down in one sentence. "This agent creates YouTube content optimized for subscriber growth." If you can't write that sentence, your agent is too broad. This isn't just organizational advice. A clear intent shapes the agent's system prompt, constrains its decision-making, and makes evaluation straightforward.

Start narrow, stay narrow

Resist the urge to add "just one more skill." That impulse is how you end up back at the Swiss-army-knife problem. If a new capability doesn't directly serve the agent's stated intent, it belongs on a different agent. Scope creep kills agents just like it kills software projects.

Pick a platform that supports specialization

Persistent memory, extensible skills, and built-in messaging (Slack, Telegram, Discord) matter more than raw capability. Ephemeral cloud-VM approaches that spin up a fresh computer for each task work for one-off queries but aren't built for the kind of long-running, personality-rich agents that specialization demands. You need agents that remember what they've learned and stay in character across interactions.

Plan for inter-agent communication

This is the unsolved frontier. How do your agents share relevant findings without polluting each other's contexts? How do they coordinate without creating the very coupling you're trying to avoid? The teams that figure this out first will have a massive advantage.

The Future Is a Team, Not a Hero

If 2025 was the year we proved AI agents could work, 2026 is the year we figure out how to make them work together.

The path forward isn't building one super-agent that can do everything. It's building a team of specialists that each do their one thing exceptionally well, coordinated by a thin orchestration layer that routes tasks to the right expert.

I've seen builders planning to run 15 or more specialized agents to manage entire business divisions. Not because any single one of them is particularly brilliant. But because 15 focused agents, each with a clear purpose and a manageable skill set, will collectively accomplish what no single agent could.

The best agent isn't the one that can do everything. It's the one that does its one thing so well you forget it's AI.

Further Reading