The moment you connect an MCP server, your coding agent stops being a thing that reads and writes in your repo and becomes a thing that can reach out and act. Read a database, hit an API, touch a service, pull in a web page. That’s the entire appeal. It’s also the entire problem, and the two are the same feature seen from two sides.
I went through this wiring up tools for my own plugin work, and the thing that saved me from a worse mistake was a scar I already had. I’d shipped an AI chatbot earlier where I rendered the model’s output straight to the page and ate an HTML injection bug. That one taught me a rule the hard way: anything an LLM hands back is untrusted input. MCP is that same lesson with the blast radius turned up, because now the untrusted thing isn’t just my model’s text, it’s whatever a connected server decides to return.
Two different fears, and people only have one
When people talk about agent safety, they almost always mean: what if the agent runs something destructive. Deletes files, force-pushes, curls something it shouldn’t. That fear is real and it has a real answer, which I’ll get to.
But it’s only half the threat, and it’s the visible half. The other half is quieter: what if the agent believes something it shouldn’t. An MCP server returns an API response, a database row, an issue body, an email thread, and somewhere in that returned content is a line that reads like an instruction. The model has no reliable way to tell your instruction from text that arrived inside data. To it, they’re the same channel. So the danger isn’t only the command the agent runs, it’s the command it gets talked into running by content that came back through a tool.
Those are two separate problems, and they need two separate defenses. Most setups install one and assume they’re covered.
The output side: treat every tool return as a form field
Here’s the rule I carry over from the chatbot bug. Whatever an MCP server returns is in exactly the same trust category as a string a stranger typed into a form on your site. Not “data from my tool.” Data from outside, that happens to arrive through my tool.
That reframe changes what you do with it. You don’t pass a tool’s raw output straight into the arguments of the next command. You don’t render it to a screen without escaping. And you don’t let an imperative sentence buried in a returned document quietly become something the agent treats as a directive. The output is a payload to inspect, not a value to trust, and the fact that it came from a server you connected doesn’t launder it. You connected the pipe. You didn’t vouch for everything that flows through it.
This is the half the tooling won’t do for you, because it can’t. No sandbox flag knows whether a returned string is a legitimate API result or a planted instruction. That judgment lives in how you wire the data, not in a setting.
The action side: walls, not manners
The other half, the destructive-command fear, does have a settings answer, and it’s worth getting exactly right because it’s easy to half-do.
Asking the model nicely not to run dangerous things is manners, and manners get bypassed the moment the input talks it into something. What you want is a wall. In Claude Code that’s the sandbox: turn it on and Bash execution gets isolated at the OS level, with the restriction inherited by every subprocess the agent spawns. Seatbelt on macOS, Bubblewrap on Linux and WSL2.
// .claude/settings.json
{
"sandbox": {
"enabled": true,
"allowUnsandboxedCommands": false
}
}
Here’s the part that surprised me, straight from the docs: the sandbox restricts writes to your working directory, but by default it still allows reads across most of the machine. Which means credentials like ~/.aws/credentials and ~/.ssh/ are readable unless you say otherwise. The sandbox is a wall around writing and executing, not around reading. Reading you close yourself, with deny rules:
{
"permissions": {
"deny": [
"Bash(rm -rf *)",
"Bash(curl *)",
"Bash(wget *)",
"Read(./.env)",
"Read(./.env.*)",
"Read(~/.ssh/**)",
"Read(~/.aws/**)"
]
}
}
This is the combination that actually matters, and it’s exactly where the two halves meet. If a tool return talks the agent into exfiltrating a secret, the read-deny is what stops it from reading the secret, and the network-command deny is what stops it from sending it. The output-side problem is what gets the agent to try; the action-side walls are what make the attempt fail. Neither covers for the other.
The third thing: don’t connect what you don’t trust, or don’t need
The server itself is attack surface. A connected MCP server you didn’t vet is code running in your loop with a channel into your agent. So three habits, none of them clever:
Vet the source before you connect it. Who wrote it, can you see the code. An MCP server is not a browser tab you can close and forget; while it’s connected, it’s part of your trust boundary.
Drop the ones you’re not using. Every idle server is attack surface and context weight at once, and /mcp will show you what’s actually connected versus what you wired up once and forgot.
And the one I reach for most: if a CLI already does the job, don’t stand up a server at all. I drive WordPress with wp-cli, and rather than wrap it in an MCP server I call it from a Skill when I need it. One less always-connected surface, one less thing returning content I’d have to distrust. “Connect a server” and “call a command” are not the same risk, and the second is often the smaller one.
What I actually check now
Before I wire up any MCP server, three questions, in this order. Do I trust where this server came from. What’s the worst command it could talk my agent into, and is that command walled off by deny rules. And is the output going to get treated as untrusted input everywhere it lands, or am I about to pass a stranger’s text straight into a sink.
The sandbox answers exactly one of those. The other two are on me, and they’re the ones that bit me first. Connecting a server is genuinely the moment your agent grows hands. Worth remembering that hands can be guided by whoever’s holding the other end of the tool.
I build WordPress plugins and write about AI tooling and security at https://raplsworks.com/.
