Incident response
Use this when: anything customer-affecting. Outage, data
integrity breach, suspected security event, accidental data
exposure, integration failure cascading across multiple customers.
| Last drill | Elapsed | Next drill |
|---|---|---|
| Not yet executed | n/a | First Friday after MVP private beta |
Severity matrix
| Severity | Examples | First-response SLA |
|---|---|---|
| P1 | Service down for >5% of customers; data integrity event; suspected breach; cryptographic chain broken | 15 minutes |
| P2 | Service down for one customer; extraction-layer degradation (docling/engine); integration cascading failures (multi-customer) | 1 hour |
| P3 | Single-customer integration failure; non-blocking degradation | 4 business hours |
| P4 | Cosmetic / informational | Next business day |
On-call
Sprint 0: founder is sole on-call. Once paid GA lands, contractor backup is engaged for nights and weekends.
Pager: BetterStack. Backup channel: SMS to founder's mobile. Both are set up in BetterStack on first day of paid beta.
P1 response checklist (run in order)
- Acknowledge the page within 15 minutes. Reply in the
#incident channel: "Acked, investigating, ETA next update 15 min."
- Open an incident channel:
#incident-YYYY-MM-DD-<short-name>.
Pin the runbook URL.
- Stop the bleeding. If the cause is obvious and reversible
(e.g., bad deploy), roll back. Otherwise see step 4.
- Triage:
- Cloudflare dashboard → check Worker error rate and queue depth. - Fly dashboard → check muntin-docling-ephemeral-iad and muntin-extract-iad machine status, recent deploys. - Fly.io status page (if extraction is failing — both muntin-docling-ephemeral-iad and muntin-extract-iad ride Fly). - Postgres (Neon) → check connection count, recent slow queries. - R2 dashboard → check storage class + recent operations. - Customer-facing status board: https://status.muntin.digital (rendered from infra/status-page/components.yaml; deploy notes in infra/status-page/deploy.md).
Worker 503 → status URL (H-sre-3 follow-up). When the Worker eventually grows a global 503 handler (out of scope for the H-sre-3 status-page batch), it MUST link customers to https://status.muntin.digital -- either via a Link: <https://status.muntin.digital>; rel="status" header on every 5xx response or inline in the JSON body (status_url: "https://status.muntin.digital"). Until that handler lands, the email-template footer + the in-product error boundary carry the link; the runbook's "customer notification" step below covers the manual path.
- Customer notification: if the event affects a customer, post
to the status page and email affected customers within the following SLAs: - Status-page incident-create: when the BetterStack monitor-fire payload lands (page, SMS, or #alerts channel post), the on-call MUST open a corresponding status-page incident immediately — do not wait for triage to finish. Copy the monitor's affected-component(s) field into the status-page component selector, set state to investigating, and use the alert payload's summary line as the first update. The status-page incident is the customer-facing ledger of every monitor fire; if it lags the page, customers find out from social media first. - GDPR Personal Data Breach: 72 hours from becoming aware (DPA Section 10). "Becoming aware" means internal confirmation of breach — not a customer report, not a guess. Document the timestamp at which awareness is established. - Service unavailability: every 30 minutes during the incident, plus a final post-mortem within 5 business days.
- Recovery: confirm health checks green, queue drained,
error rate baseline. Wait 15 minutes before declaring resolved.
- Post-mortem: within 5 business days. Blameless. Includes
timeline, root cause, contributing factors, action items, ownership.
P1 examples and shortcuts
Audit chain broken (verify endpoint returns valid: false)
A customer's GET /v1/audit/verify returns valid: false. This means the hash chain has been tampered or a row was modified outside the append path. This is always a P1.
- Pull the affected events:
wrangler d1 execute muntin-api --command "SELECT id, ts, action FROM audit_events WHERE org_id = '<id>' ORDER BY ts ASC". _(Binding namemuntin-apimatchesapps/api/wrangler.toml'sdatabase_name. If you see a different binding inwrangler.toml, use that.)_ - The break point is the
brokenAtevent id; investigate every
row from there onward.
- Restore the affected rows from the WORM mirror in S3 Object Lock
(Phase-4 capability; until then escalate to founder direct).
- Notify the customer within 24 hours regardless of whether the
tamper was malicious. The audit chain is the privacy commitment; if it breaks, the customer hears about it.
docling worker pool unhealthy
Queue depth >100 sustained for >5 minutes. flyctl logs --app muntin-docling-ephemeral-iad.
- Scale machine count:
flyctl scale count 6 --app muntin-docling-ephemeral-iad. - Check whether a single document is wedging a worker (90+ second
processing): page-event log will show the wedge.
- If a wedge: kill the affected machine (`flyctl machine destroy
<id>`); job goes to DLQ; manual replay later.
Extract service 502s sustained
Most likely a docling Machine spawn failure or the deterministic engine hitting a fallback ladder. See extraction-degradation.md.
P2 response
- Acknowledge within 1 hour. Same channel + customer-notification
cadence as P1 but with relaxed timing.
- If the issue is single-customer: tag it on a calendar item for
next-business-day fix; communicate that to the customer.
After every incident
- Update this runbook with whatever you wished you had known when
you started.
- Open a follow-up issue for any preventive change identified.
- Close the incident channel; archive the post-mortem in the wiki.