Deploying an MCP Server to Production: Auth, Rate Limits, Observability
A production checklist for shipping MCP servers — OAuth2 scopes, token rotation, rate limiting, structured logs, tracing, and real deploy patterns on Fly, Railway and Kubernetes.
Most MCP tutorials stop at npx @modelcontextprotocol/create-server and a localhost stdio transport. That is fine for a laptop demo. It is not what ships. This guide is the checklist I run through before putting an MCP server behind a public endpoint that real agents hit.
The three deployment modes, and when each makes sense
MCP is transport-agnostic. In practice you pick one of three modes:
- stdio — co-located, launched by the agent as a child process. Zero network attack surface, zero auth, lowest latency. Use for local-only tools (filesystem, shell).
- Streamable HTTP (the replacement for deprecated SSE) — stateless or stateful HTTP, long-poll or chunked. Use when multiple agents on multiple machines call one server, or when you want a centralised audit log.
- WebSocket — rare in practice; only useful if you need server-push unsolicited notifications with low framing overhead.
The remote modes are where everything in the rest of this guide applies.
Auth: do not roll your own
The current MCP auth spec references OAuth 2.1 with PKCE and explicit resource indicators (RFC 8707). This matters because MCP tokens travel as Authorization: Bearer headers and you will eventually mint per-agent scopes — so you want a token server, not a static shared secret.
A pragmatic stack that works today:
┌──────────┐ OAuth ┌───────────┐ JWT (RS256) ┌────────────┐
│ Agent │ ─────────► │ Clerk / │ ──────────────► │ MCP Server │
│ (Claude) │ ◄───────── │ Auth0 / │ │ │
└──────────┘ Bearer │ Ory │ └────────────┘
└───────────┘
Concrete requirements:
- Never accept static API keys except as a fallback for CI. They leak, they never rotate, and you cannot revoke individual agents.
- Verify the audience claim. If your JWT
auddoes not match your MCP server URL, reject. This blocks token-substitution attacks where a stolen token from an unrelated service is replayed against your tools. - Scope tools, not servers. One MCP server may expose
search,write_issue, anddelete_repo. Your scope grammar should be able to say "read-only" and have the server enforce it at thelist_toolslevel — do not rely on the agent to respect it.
Rate limiting: per-tool, per-agent, not per-IP
Agent traffic is not human traffic. A Claude session can legitimately fire forty tool calls in ten seconds during a coding task. Per-IP rate limits will either throttle real work or be set so high they are useless.
Rate limit on (agent_id, tool_name) tuples. The agent_id comes from the JWT sub. Typical budgets that I have found do not break real workflows:
- Read tools (
search,get): 300 req/min per agent - Write tools (
create,update): 60 req/min per agent - Destructive tools (
delete,drop): 5 req/min per agent with a 429 that includes the reset time inRetry-After
Return proper 429 with a structured error payload. MCP clients increasingly respect Retry-After; Claude Code as of late 2026 does.
Observability: what to log, what not to log
Three signals matter: per-tool latency, per-tool error rate, token cost attribution. The fourth signal people forget is prompt-echo detection — did the server return the request verbatim, which usually means a broken tool handler falling through to a default.
A minimal OTel span schema for an MCP tool call:
span.name = "mcp.tool.call"
attributes:
mcp.tool.name = "search_repos"
mcp.agent.id = "agent_a1b2c3" # from JWT sub
mcp.client.name = "claude-code" # from client info
mcp.client.ver = "0.7.3"
mcp.input.bytes = 412
mcp.output.bytes = 8912
mcp.result = "ok" | "error" | "rate_limited"
error.type = "upstream_timeout" # on failure
What NOT to log in production:
- Tool arguments verbatim — they frequently contain PII from the agent's context window.
- Full tool output — same reason, plus cardinality explosion.
- JWTs, even hashed. Use the
subclaim only.
Deploy targets, ranked
For a small MCP server (single container, stateless, <100 RPS):
- Fly.io — best cold-start, global anycast out of the box, per-machine budget knobs. Downside: the 256 MB free tier will OOM on a JSON-heavy tool. Use
shared-cpu-1x@1024minimum. - Railway — easiest DX, solid if you already use it. Cold starts are noticeably slower than Fly.
- Render — good free tier, but sleeps aggressively. Not ideal for agent workloads where the first call after sleep costs the agent retry budget.
For anything stateful or larger:
- Kubernetes — the usual boring choice. Two knobs that matter specifically for MCP: set
terminationGracePeriodSeconds: 60so in-flight tool calls can finish on rollout, and use aPodDisruptionBudgetwithmaxUnavailable: 1to avoid dropping all pods mid-session.
Real numbers I measured
A Fly.io shared-cpu-1x@1024 region=cdg running a TypeScript MCP server with three tools (search, fetch, summarise) on a cold deploy:
- Cold start: 420 ms (Node 20 + ESM)
- P50 tool call latency: 38 ms
- P99 tool call latency: 180 ms (network-bound upstream)
- Memory RSS steady state: 94 MB
- Cost: ~$3/mo for 24/7 single machine, scaled to zero would be ~$0.40/mo
For the same workload on Railway I measured ~110 ms P50 and ~900 ms P99; the long tail is enough to matter when an agent is chaining ten calls.
Deploy checklist
- [ ] OAuth 2.1 with PKCE, not static keys
- [ ] JWT audience verification
- [ ] Per-tool-per-agent rate limits with correct 429 +
Retry-After - [ ] Structured logs, no tool-args or tool-output in them
- [ ] OTel trace export to something (Honeycomb, Grafana Cloud, SigNoz)
- [ ]
terminationGracePeriodSeconds>= 60s - [ ] Health check at
/healthzthat does not require auth - [ ] CORS locked down; MCP clients do not need browser CORS
- [ ] Audit log of every tool call in a separate append-only store
Everything in this checklist is table stakes. If one row is missing, you will find out the hard way the first time an agent misbehaves.
Running MCP in production?
Centralised auth, cost analytics, and the APC optimization layer — free tier included.