Incident response

Use this when: anything customer-affecting. Outage, data
integrity breach, suspected security event, accidental data
exposure, integration failure cascading across multiple customers.

Last drill	Elapsed	Next drill
Not yet executed	n/a	First Friday after MVP private beta

Severity matrix

Severity	Examples	First-response SLA
P1	Service down for >5% of customers; data integrity event; suspected breach; cryptographic chain broken	15 minutes
P2	Service down for one customer; extraction-layer degradation (docling/engine); integration cascading failures (multi-customer)	1 hour
P3	Single-customer integration failure; non-blocking degradation	4 business hours
P4	Cosmetic / informational	Next business day

On-call

Sprint 0: founder is sole on-call. Once paid GA lands, contractor backup is engaged for nights and weekends.

Pager: BetterStack. Backup channel: SMS to founder's mobile. Both are set up in BetterStack on first day of paid beta.

P1 response checklist (run in order)

Acknowledge the page within 15 minutes. Reply in the

#incident channel: "Acked, investigating, ETA next update 15 min."

Open an incident channel: #incident-YYYY-MM-DD-<short-name>.

Pin the runbook URL.

Stop the bleeding. If the cause is obvious and reversible

(e.g., bad deploy), roll back. Otherwise see step 4.

Triage:

- Cloudflare dashboard → check Worker error rate and queue depth. - Fly dashboard → check muntin-docling-ephemeral-iad and muntin-extract-iad machine status, recent deploys. - Fly.io status page (if extraction is failing — both muntin-docling-ephemeral-iad and muntin-extract-iad ride Fly). - Postgres (Neon) → check connection count, recent slow queries. - R2 dashboard → check storage class + recent operations. - Customer-facing status board: https://status.muntin.digital (rendered from infra/status-page/components.yaml; deploy notes in infra/status-page/deploy.md).

Worker 503 → status URL (H-sre-3 follow-up). When the Worker eventually grows a global 503 handler (out of scope for the H-sre-3 status-page batch), it MUST link customers to https://status.muntin.digital -- either via a Link: <https://status.muntin.digital>; rel="status" header on every 5xx response or inline in the JSON body (status_url: "https://status.muntin.digital"). Until that handler lands, the email-template footer + the in-product error boundary carry the link; the runbook's "customer notification" step below covers the manual path.

Customer notification: if the event affects a customer, post

to the status page and email affected customers within the following SLAs: - Status-page incident-create: when the BetterStack monitor-fire payload lands (page, SMS, or #alerts channel post), the on-call MUST open a corresponding status-page incident immediately — do not wait for triage to finish. Copy the monitor's affected-component(s) field into the status-page component selector, set state to investigating, and use the alert payload's summary line as the first update. The status-page incident is the customer-facing ledger of every monitor fire; if it lags the page, customers find out from social media first. - GDPR Personal Data Breach: 72 hours from becoming aware (DPA Section 10). "Becoming aware" means internal confirmation of breach — not a customer report, not a guess. Document the timestamp at which awareness is established. - Service unavailability: every 30 minutes during the incident, plus a final post-mortem within 5 business days.

Recovery: confirm health checks green, queue drained,

error rate baseline. Wait 15 minutes before declaring resolved.

Post-mortem: within 5 business days. Blameless. Includes

timeline, root cause, contributing factors, action items, ownership.

P1 examples and shortcuts

Audit chain broken (verify endpoint returns valid: false)

A customer's GET /v1/audit/verify returns valid: false. This means the hash chain has been tampered or a row was modified outside the append path. This is always a P1.

Pull the affected events: wrangler d1 execute muntin-api --command "SELECT id, ts, action FROM audit_events WHERE org_id = '<id>' ORDER BY ts ASC". _(Binding name muntin-api matches apps/api/wrangler.toml's database_name. If you see a different binding in wrangler.toml, use that.)_
The break point is the brokenAt event id; investigate every

row from there onward.

Restore the affected rows from the WORM mirror in S3 Object Lock

(Phase-4 capability; until then escalate to founder direct).

Notify the customer within 24 hours regardless of whether the

tamper was malicious. The audit chain is the privacy commitment; if it breaks, the customer hears about it.

docling worker pool unhealthy

Queue depth >100 sustained for >5 minutes. flyctl logs --app muntin-docling-ephemeral-iad.

Scale machine count: flyctl scale count 6 --app muntin-docling-ephemeral-iad.
Check whether a single document is wedging a worker (90+ second

processing): page-event log will show the wedge.

If a wedge: kill the affected machine (`flyctl machine destroy

<id>`); job goes to DLQ; manual replay later.

Extract service 502s sustained

Most likely a docling Machine spawn failure or the deterministic engine hitting a fallback ladder. See extraction-degradation.md.

P2 response

Acknowledge within 1 hour. Same channel + customer-notification

cadence as P1 but with relaxed timing.

If the issue is single-customer: tag it on a calendar item for

next-business-day fix; communicate that to the customer.

After every incident

Update this runbook with whatever you wished you had known when

you started.

Open a follow-up issue for any preventive change identified.
Close the incident channel; archive the post-mortem in the wiki.

incident-response.md

What this is

What it proves

What to look for in the source below