Phase 8 · Deploy MCP·8 steps

/health endpoint, liveness vs readiness, the 25h-down lesson

What goes in /health, what stays out, why a DB-roundtrip-less /health is treacherous, and the one-liner that prevents your container from looking healthy while the DB is down.

8 steps0%

Du liest ohne Account. Mit Login speichern wir Step-Fortschritt + Notes.

/health endpoint, liveness vs readiness, the 25h-down lesson

/health is the simplest endpoint in your server and the one with the highest blast radius. Get it wrong and your monitoring lies to you. We had a SaaS report (healthy) for 25 hours while it was completely cut off from its database. This recipe is the right shape, the right depth, and the right things to leave out.

Schritt 1: Liveness vs readiness, pick both, separately

Two distinct questions:

| Question | Endpoint | What it checks | |---|---|---| | Liveness, "is the process alive?" | /health (or /livez) | The process responds to HTTP. That's it. | | Readiness, "can it serve traffic?" | /ready (or /readyz) | DB roundtrip + critical dependencies reachable |

For most MCP servers, both checks live at /health because the container orchestrator only checks one URL. The trap: if /health only checks "process responds", the container reports healthy while the DB is unreachable.

Schritt 2: The 25h-down story

Real incident. Container (healthy) for 25 hours. /health returned 200 instantly. Users couldn't sign in for 25 hours.

What happened:

We migrated the DB from one Postgres host to another.
New Postgres host had only IPv6, the container only IPv4. Every DB call failed with ENOTFOUND.
/health didn't touch the DB. It returned {"status":"ok"} based on whether the HTTP server could respond.
Docker healthcheck: green. Monitoring dashboards: green. Users: locked out.

The fix is one DB round-trip in /health:

Schritt 3: The right /health shape

import { logger } from './lib/logger.js';
import { db } from './lib/db.js';

export async function handleHealth(req, res) {
  const checks: Record<string, { ok: boolean; latencyMs?: number; error?: string }> = {};
  const start = Date.now();

  // Check 1: DB round-trip
  try {
    const t0 = Date.now();
    await db.query('SELECT 1');
    checks.db = { ok: true, latencyMs: Date.now() - t0 };
  } catch (err) {
    checks.db = { ok: false, error: err instanceof Error ? err.message : String(err) };
  }

  // Check 2 (optional): critical third-party API. Stripe, etc.
  // Skip for now, adds latency, can fail for reasons unrelated to your service.

  const allOk = Object.values(checks).every((c) => c.ok);
  const status = allOk ? 200 : 503;

  res.writeHead(status, { 'Content-Type': 'application/json' });
  res.end(JSON.stringify({
    status: allOk ? 'ok' : 'degraded',
    server: process.env.SERVER_NAME ?? 'unknown',
    version: process.env.SERVER_VERSION ?? '0.0.0',
    uptimeSec: Math.floor(process.uptime()),
    totalLatencyMs: Date.now() - start,
    checks,
  }));
}

What this gives you:

SELECT 1 is the cheapest possible DB query (~1ms when healthy). The roundtrip is what catches connectivity failures.
status: 503 when something is wrong. Cloud Run + Docker treat 503 as unhealthy.
Latency per check, early signal of degradation.
Per-check status, when something's broken, you know which thing.

Schritt 4: What to keep OUT of /health

Three categories:

1. Per-tenant data. Don't put tenant counts, last-signup time, anything that grows with users. /health is hit hundreds of times per day by monitoring; you don't need it generating DB load.

2. Heavy queries. /health should be < 100ms. No COUNT(*). No EXPLAIN. Just SELECT 1 or equivalent.

3. Secrets / internal IDs. /health is unauthenticated and often public. Don't leak anything you wouldn't put in a status page.

Schritt 5: Docker healthcheck wired to it

The cleanest setup is a tiny healthcheck.js next to your server, not an inline node -e string:

// healthcheck.js
const port = process.env.PORT ?? '3000';
fetch(`http://localhost:${port}/health`)
  .then((r) => process.exit(r.ok ? 0 : 1))
  .catch(() => process.exit(1));

COPY healthcheck.js /app/healthcheck.js

HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
  CMD node /app/healthcheck.js

Why a file beats inline node -e:

Easier to test locally, node healthcheck.js works on your laptop too.
No shell-quoting hell, inline node -e "..." plus single/double quotes plus process.env interpolation is a debugging nightmare when it breaks. A real file is just JavaScript.
The .catch() matters, without it, a connection-refused error inside fetch becomes an unhandled rejection and the healthcheck times out instead of failing cleanly.

If you really want to keep it inline (single-file Dockerfile setups, demos), the version below works but be very careful with the catch:

HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
  CMD node -e "fetch('http://localhost:'+(process.env.PORT||'3000')+'/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

Settings explained:

interval=30s, check every 30s. Cheap because /health is cheap.
timeout=5s, /health should be sub-100ms; 5s is generous for catching hangs.
start-period=20s, give the app 20s to boot before checks count against it.
retries=3, 3 consecutive failures before reporting unhealthy. Avoids flapping on transient DB hiccups.

docker ps then shows (healthy) / (unhealthy) / (starting). Your monitor reads that state, not the bare URL.

Schritt 6: Cloud Run / MCPize wires it for you

If you deploy via Cloud Run or MCPize, /health is hit automatically as the TCP probe target. You don't need to configure anything in the platform, just expose /health, return 200 when ready, and it'll be picked up.

For custom Cloud Run config:

livenessProbe:
  httpGet: { path: /health, port: 8080 }
  initialDelaySeconds: 20
  periodSeconds: 30
  timeoutSeconds: 5
  failureThreshold: 3

Client-Check · auf Deinem Rechner ausführen

(grep -RnE "[\x22\x27\x60]/health[\x22\x27\x60]" src/saas/server.ts src/server.ts src/http.ts 2>/dev/null | head -3) || echo "no /health route in source"

Erwartet: A `/health` route literal appears in src/saas/server.ts or src/server.ts.

Falls hängen geblieben: Add a simple `GET /health` returning `{ status: "ok", version: "..." }` so Docker / Cloud Run can probe liveness.

Schritt 7: External monitoring (Uptime Kuma / better-stack)

Container healthcheck only sees the container. External monitoring sees the URL, catches DNS failures, TLS expiry, CDN issues that internal checks miss.

Uptime Kuma is the easy self-hosted option. Add a monitor:

Type: HTTP(s)
URL: https://your-mcp.io/health
Interval: 60s
Notify: Telegram / Slack / email

When the URL fails 3 consecutive times, you get paged.

Schritt 8: Verify

Run academy_validate_step. The validator hits /health and confirms 200.

Manual end-to-end:

# 1. Healthy state
curl -s https://your-mcp.io/health | jq
# {
#   "status": "ok",
#   "server": "your-mcp",
#   "version": "1.0.0",
#   "uptimeSec": 3600,
#   "totalLatencyMs": 8,
#   "checks": { "db": { "ok": true, "latencyMs": 4 } }
# }

# 2. Simulate DB failure (don't do in prod!)
# Stop the DB / break the connection string, then hit /health
# → 503 Service Unavailable
# → "status": "degraded", "checks": { "db": { "ok": false, "error": "..." } }

# 3. Confirm Docker healthcheck reflects it
docker ps --filter name=your-mcp
# → STATUS: Up 2 hours (unhealthy)

The third check is the one that matters. If /health returns 503 but Docker still says (healthy), your HEALTHCHECK directive isn't reading the response body / status code, fix the Dockerfile.

Common traps

/health doesn't touch the DB, covered in Step 2 (the 25h-down story).
/health returns 200 even on errors. Cloud Run + Docker can't distinguish. Always 503 on degraded.
/health does an expensive query, adds load on every check, can take down the DB during outages.
/health requires auth. Docker / Cloud Run can't authenticate. Always public.
No external monitoring, internal healthcheck doesn't see DNS / TLS / CDN failures.
Single point of monitoring, if your one monitoring tool fails, you don't know your site is down. Two independent monitors (Kuma + Better Stack) cover each other.

What good looks like

/health returns < 50ms when healthy, 503 when DB is unreachable, JSON shape with { status, server, version, uptimeSec, checks }. Docker healthcheck wired. External monitor (Kuma) checks every 60s and pages on 3 failures. When something breaks, you know within 3 minutes, not 25 hours.

If your container has reported (healthy) for an unbroken stretch through a Postgres incident, your /health is lying. Fix it before you migrate the next thing.

← Nginx + Let's Encrypt, product Blue-green deploys, zero-downt →