MCP Observability: Tracing, Metrics, and Alerts That Actually Matter
A production-grade observability stack for MCP servers: OpenTelemetry traces, Prometheus metrics, structured logs with CID, and the 5 alerts every MCP operator needs.
Most MCP servers in the wild ship with console.log and a prayer. When they break in production, operators learn the hard way that "log lines grep" is not an observability strategy. This guide is the exact stack we deploy at MCPizy and recommend for any MCP server past 1k tool calls/day.
Three pillars, one purpose
Traces answer what happened. Metrics answer how often. Logs answer why. You need all three; one cannot replace the others.
- Traces — OpenTelemetry, one span per tool call, with inputs/outputs redacted to fields-of-interest only.
- Metrics — Prometheus histograms for latency, counters for errors by class, gauges for in-flight requests.
- Logs — structured JSON, one Correlation ID (CID) per client session, shipped to Loki or OpenSearch.
The 5 alerts every MCP server needs
- Error rate > 2% for 5 min — 5xx + tool-call failures combined. Page oncall.
- p95 tool latency > 2x baseline for 10 min — slow regression before users notice.
- Auth failure rate > 5% for 2 min — early signal of token rotation or brute-force.
- Upstream dependency down > 30s — if your MCP calls Tavily / Supabase / Stripe, their downtime is your downtime.
- Queue depth > 100 for 1 min — scale-out signal before 429s flow through.
Tracing: span structure
tool_call (span)
├─ attributes: mcp.tool_name, mcp.client_id, mcp.session_id
├─ auth_check (child)
├─ argument_validation (child)
├─ upstream_call (child, per external dependency)
│ └─ attributes: http.method, http.status_code, http.duration
├─ post_processing (child)
└─ serialize_response (child)
Each span carries the CID as a baggage propagation header (traceparent) so downstream services inherit the trace context automatically.
Metrics: label discipline
The #1 rookie mistake in MCP observability is high-cardinality labels. Never label by user ID, session ID, or request ID — they explode metrics cardinality and your Prometheus bill.
Safe labels:
mcp_tool_name(bounded, under 200)mcp_status(ok, error_client, error_server, timeout)upstream_dependency(bounded, the list of SaaS you call)http_method,http_status_code
Log format
{
"ts": "2026-04-24T12:00:00.000Z",
"level": "info",
"cid": "sess_abc123",
"mcp_tool": "search_products",
"mcp_client": "claude-desktop",
"duration_ms": 142,
"status": "ok",
"user_id_hash": "h_xyz",
"msg": "tool_call_complete"
}
Redaction rules
- Arguments over 1000 chars: truncate + indicate truncation
- Fields matching secret patterns (apikey, token, password, cookie): replace with
[REDACTED] - URLs with query strings: log the host+path only, drop the query
- Emails: hash (first-3-chars + hash(rest))
Reference stack
| Concern | Self-hosted | Managed |
|---|---|---|
| Traces | Grafana Tempo | Honeycomb, Datadog APM |
| Metrics | Prometheus + Grafana | Datadog, Chronosphere |
| Logs | Loki or OpenSearch | Datadog, New Relic |
| Alerting | Alertmanager + PagerDuty | Grafana OnCall, PagerDuty |
For MCPizy customers, all four pillars are built-in from Pro — the APC layer auto-instruments every tool call.
What "good" looks like
You should be able to answer these in under 30 seconds:
- Which tool is slowest right now, and by how much vs. yesterday?
- Who (which client) is hitting errors in the last hour?
- Is the recent spike caused by upstream (external) or us (internal)?
- What does a typical successful call look like vs a typical failing one?
If any of those takes more than 30 seconds, your stack is missing a piece.
Running MCP in production?
Centralised auth, cost analytics, and the APC optimization layer — free tier included.