Skip to main content

Three layers: a short note at the top, the key lines with our take in the middle, the full source at the bottom.

API route

demo-extract.ts

The /v1/demo/extract API route, kept for internal tests. No persistence, no telemetry, no model training — read the code to confirm.

Repo path apps/api/src/routes/demo-extract.tsLanguage TypeScript

What this is

An internal API endpoint we keep for our own tests. It runs the same reading pipeline as authenticated uploads, but in memory only — no storage write, no database row, no audit-log entry. The /demo page does not call this endpoint anymore; it survives so the engineering team can test the read pipeline against fresh inputs without involving customer data.

What it proves

Backs the promise that the /demo page never asks for your invoice. Even though this endpoint exists, the demo page has no path to call it, and the endpoint itself writes nothing to disk. Read the promise →

What to look for in the source below

  • The handler reads the file from the request body, runs extraction, and returns the parsed result.
  • No call to the storage layer, no insert into the database, no audit-log write — checked by the CI script.
  • A clear comment at the top of the file naming this as the internal-tests-only endpoint.
Show the full file (479 lines)

478 lines

import { Hono } from "hono";
import type { Env } from "../env";
import { runExtraction, type IngestQueueMessage } from "../lib/queue";
import { detectFormat } from "../lib/ingest-format";
import { log } from "../lib/logger";
import { rateLimitAsn } from "../middleware/rate-limit";

/**
 * POST /v1/demo/extract — anonymous, no-persistence extraction.
 *
 * The route the chef-owner hits when she clicks "Try it with a
 * Sysco PDF →" on the landing page. NO auth, NO durable storage,
 * NO log of file bytes — the privacy promise on
 * /promises#anonymous-demo is a CI contract enforced by
 * scripts/check-demo-no-persistence.mjs.
 *
 * The route runs the SAME extraction engine as authenticated
 * uploads (runExtraction in lib/queue.ts) — Lens 06 §10 #6: "the
 * demo IS the product, running in a hostile mode. Same engine,
 * same determinism, no shortcuts." A stub R2ObjectBody satisfies
 * runExtraction's arrayBuffer() contract from a Uint8Array held
 * only in this request's memory.
 *
 * Threat model + mitigations (Lens 06 §9 + p1-plan §8):
 *   1. Budget-burn → daily breaker ($50/day default) returns 503
 *      with fallback="static" past cap.
 *   2. Per-IP flood → daily cap (5/day default) returns 429.
 *   3. Bot abuse → Turnstile challenge from run-2-of-day. If
 *      TURNSTILE_SECRET_KEY unprovisioned, degrades to 1/day
 *      hard cap (documented in p1-done.md).
 *   4. Decompression bomb / oversized PDF → 5MB content-length
 *      cap, pre-body via header + post-read.
 *   5. Replay → SHA-256 + per-IP key, 1h TTL. Same bytes from
 *      same IP within 1h return cached result without burning
 *      another extraction.
 *   6. Bytes-in-log → never log the raw file body, never stringify
 *      it. The funnel.demo_upload_parsed event carries format +
 *      vendor-recognised + parse-success only.
 *   7. Persistence regression → check-demo-no-persistence.mjs
 *      greps this file for r2Put / documentsStore / c.env.DB /
 *      audit.append / enqueueIngest / log-of-bytes. Build fails
 *      if any appears.
 *   8. NCMEC hash check → STUBBED in P1 (returns false). The
 *      binding requires founder NCMEC enrolment; documented in
 *      p1-done.md as P1 debt.
 *
 * Response envelope (Lens 06 §5):
 *   { ok: true, cached: boolean, invoice: Invoice, needs_review,
 *     needs_review_reasons }
 *   { ok: false, reason: <enum>, ...context }
 *
 * Cohort source: synthesis §4 P1.1; Lens 06 §10 + §11; p1-plan A2.
 */

export const demoExtract = new Hono<{ Bindings: Env }>();

const DEFAULT_MAX_BODY_BYTES = 5 * 1024 * 1024; // 5MB
const DEFAULT_PER_IP_DAILY_CAP = 5;
const DEFAULT_DAILY_BUDGET_CENTS = 5000; // $50
const APPROX_COST_PER_RUN_CENTS = 2; // ~$0.02 per extraction
const REPLAY_TTL_SEC = 60 * 60; // 1h
const DAILY_TTL_SEC = 60 * 60 * 24;
// Closes P1 carry-over #9. Per-ASN minute cap layered ON TOP of
// the per-IP daily cap so an attacker rotating through N IPs in
// the same hosting-provider ASN (DigitalOcean, AWS, GCP, ...) does
// not get N × DEFAULT_PER_IP_DAILY_CAP runs. 200/min sits far
// above honest residential traffic — Comcast / AT&T / Verizon
// host millions of users on a single ASN and any honest minute
// looks like single-digit demo runs — and clamps coordinated
// scraping floods within 60s.
const DEFAULT_PER_ASN_PER_MINUTE = 200;

interface CachedExtractionResult {
  invoice: unknown;
  needs_review?: boolean;
  needs_review_reasons?: string[];
}

demoExtract.post("/extract", async (c) => {
  const env = c.env;

  // (1) Kill switch — emergency rollback flag. Per Risk 8.1: flip
  // DEMO_ANONYMOUS_EXTRACT=off in wrangler to drain the route
  // without a code revert.
  if (env.DEMO_ANONYMOUS_EXTRACT === "off") {
    return c.json(
      { ok: false, fallback: "static", reason: "feature_off" },
      503,
    );
  }

  // (2) Pre-body Content-Length check. Cloudflare Workers reads the
  // body lazily; if a 50MB file arrives we want to reject before
  // forming the FormData object (which would allocate the buffer).
  const contentLengthHeader = c.req.header("content-length");
  const maxBytes =
    parseIntOr(env.DEMO_DAILY_BUDGET_CENTS, 0) > 0
      ? DEFAULT_MAX_BODY_BYTES // hard ceiling — env can only reduce, not raise
      : DEFAULT_MAX_BODY_BYTES;
  if (contentLengthHeader) {
    const declaredSize = parseInt(contentLengthHeader, 10);
    if (Number.isFinite(declaredSize) && declaredSize > maxBytes) {
      return c.json(
        { ok: false, reason: "too_large", limit_bytes: maxBytes },
        413,
      );
    }
  }

  // (3) Visitor IP — Cloudflare's cf-connecting-ip header, with a
  // fallback to x-forwarded-for for non-CF dev environments. Never
  // logged in funnel events.
  const ip =
    c.req.header("cf-connecting-ip") ??
    c.req.header("x-forwarded-for")?.split(",")[0]?.trim() ??
    "unknown";

  // (4) KV availability check. The gates below all need KV; without
  // it we cannot enforce rate-limits, so we fail closed.
  const kv = env.AUTH_KV;
  if (!kv) {
    return c.json({ ok: false, reason: "infrastructure_unavailable" }, 503);
  }

  // (5) Pre-read rate-limit ONLY when no Turnstile is provisioned —
  // otherwise we'd waste memory reading 5MB just to reject. When
  // Turnstile IS provisioned, the rate-limit check defers until
  // after the replay-cache check (a returning chef-owner re-
  // uploading the same PDF must not be blocked).
  const today = new Date().toISOString().slice(0, 10);
  const ipKey = `demo:runs:${ip}:${today}`;
  const perIpCap = parseIntOr(env.DEMO_DAILY_PER_IP, DEFAULT_PER_IP_DAILY_CAP);
  const ipRunsRaw = await kv.get(ipKey);
  const ipRuns = ipRunsRaw ? parseInt(ipRunsRaw, 10) || 0 : 0;
  if (ipRuns >= perIpCap) {
    return c.json(
      { ok: false, reason: "rate_limited", retry_after_sec: DAILY_TTL_SEC },
      429,
    );
  }

  // (5b) Per-ASN minute cap. Reads request.cf?.asn (set by
  // Cloudflare on every incoming request). 0 / undefined (dev
  // env, non-CF runtime) skips the probe — the per-IP cap stays
  // load-bearing. The native INGEST_RATELIMIT binding handles
  // the bucketing; a missing binding fails open per the
  // rate-limit middleware's private-beta posture.
  const cfMeta = (c.req.raw as unknown as { cf?: { asn?: unknown } }).cf;
  const asn = cfMeta && typeof cfMeta.asn === "number" ? cfMeta.asn : 0;
  const perAsnLimit = parseIntOr(
    env.DEMO_PER_ASN_PER_MINUTE,
    DEFAULT_PER_ASN_PER_MINUTE,
  );
  const asnDecision = await rateLimitAsn(env, asn, perAsnLimit);
  if (!asnDecision.allowed) {
    log(env, {
      event: "demo.outcome",
      level: "warn",
      fields: {
        reason: "rate_limited_asn",
        http: 429,
        asn,
      },
    });
    c.header("Retry-After", String(asnDecision.retry_after_seconds));
    return c.json(
      {
        ok: false,
        reason: "rate_limited",
        scope: "asn",
        retry_after_sec: asnDecision.retry_after_seconds,
      },
      429,
    );
  }

  // (6) Daily org-wide budget breaker. Past cap, drop to static
  // fallback (the existing /demo page's four-row sample table).
  const budgetKey = `demo:budget:${today}`;
  const dailyBudgetCents = parseIntOr(
    env.DEMO_DAILY_BUDGET_CENTS,
    DEFAULT_DAILY_BUDGET_CENTS,
  );
  const budgetSpentRaw = await kv.get(budgetKey);
  const budgetSpent = budgetSpentRaw ? parseInt(budgetSpentRaw, 10) || 0 : 0;
  if (budgetSpent + APPROX_COST_PER_RUN_CENTS > dailyBudgetCents) {
    return c.json({ ok: false, fallback: "static", reason: "budget_cap" }, 503);
  }

  // (7) Read body. Multipart form-data with a "file" field is the
  // canonical shape; we also accept a raw body (octet-stream) for
  // dev convenience.
  let bytes: Uint8Array;
  try {
    const ct = c.req.header("content-type") ?? "";
    if (ct.includes("multipart/form-data")) {
      const formData = await c.req.formData();
      const file = formData.get("file");
      // File | string | null — discriminate by the methods we need.
      // We don't use `instanceof File` because the Workers types
      // don't always expose `File` as a global (the constructor
      // exists at runtime but the type-side ambient depends on the
      // worker.d.ts version).
      if (
        !file ||
        typeof file === "string" ||
        typeof (file as { arrayBuffer?: unknown }).arrayBuffer !== "function"
      ) {
        return c.json({ ok: false, reason: "missing_file" }, 400);
      }
      const fileSize = (file as { size?: number }).size ?? 0;
      if (fileSize > maxBytes) {
        return c.json(
          { ok: false, reason: "too_large", limit_bytes: maxBytes },
          413,
        );
      }
      bytes = new Uint8Array(
        await (
          file as { arrayBuffer: () => Promise<ArrayBuffer> }
        ).arrayBuffer(),
      );
    } else {
      const ab = await c.req.arrayBuffer();
      if (ab.byteLength > maxBytes) {
        return c.json(
          { ok: false, reason: "too_large", limit_bytes: maxBytes },
          413,
        );
      }
      bytes = new Uint8Array(ab);
    }
  } catch {
    return c.json({ ok: false, reason: "malformed_request" }, 400);
  }

  if (bytes.byteLength === 0) {
    return c.json({ ok: false, reason: "missing_file" }, 400);
  }

  // (9) SHA-256 of the bytes — for the replay cache. The hash
  // itself is not PII (uniformly random 256 bits) so we can log it
  // safely if needed for debugging.
  const sha = await sha256Hex(bytes);

  // (10) Replay cache: same IP + same SHA within 1h returns the
  // cached result without burning another extraction OR a
  // Turnstile challenge. Caches are KV-only (never the bytes
  // themselves). This check runs BEFORE Turnstile so a returning
  // chef-owner re-uploading the same PDF is never blocked.
  const replayKey = `demo:replay:${ip}:${sha}`;
  const cached = await kv.get(replayKey);
  if (cached) {
    let parsed: CachedExtractionResult;
    try {
      parsed = JSON.parse(cached) as CachedExtractionResult;
    } catch {
      parsed = { invoice: null };
    }
    return c.json({ ok: true, cached: true, ...parsed });
  }

  // (10b) Turnstile gate — runs AFTER the replay-cache check (so
  // re-uploads bypass) but BEFORE the engine call. From run-2-of-
  // day onward, require Turnstile (or degrade to a 1/day hard cap
  // if no secret is provisioned). The first run per IP per day is
  // unchallenged so a real chef-owner with her first PDF is never
  // blocked by a CAPTCHA.
  if (ipRuns >= 1) {
    const turnstileSecret = env.TURNSTILE_SECRET_KEY?.trim() ?? "";
    if (!turnstileSecret) {
      return c.json(
        {
          ok: false,
          reason: "rate_limited",
          retry_after_sec: DAILY_TTL_SEC,
        },
        429,
      );
    }
    const turnstileToken = c.req.header("cf-turnstile-token") ?? "";
    if (!turnstileToken) {
      return c.json({ ok: false, reason: "turnstile_required" }, 403);
    }
    const ok = await verifyTurnstile(turnstileSecret, turnstileToken, ip);
    if (!ok) {
      return c.json({ ok: false, reason: "turnstile_invalid" }, 403);
    }
  }

  // (11) Format detection (shared with the authenticated path —
  // Lens 06 §10 #6: no fork).
  const format = detectFormat(bytes);
  if (format === "unknown" || format === "eml") {
    return c.json(
      { ok: false, reason: "unsupported_format", got: format },
      415,
    );
  }

  // (12) NCMEC hash check — STUBBED in P1. The binding requires
  // founder NCMEC enrolment; documented in p1-done.md.
  // (Always returns false for now; the stub keeps the call-site
  // present so the binding can be wired without touching the
  // route shape.)
  if (await ncmecMatchStub(sha)) {
    return c.json({ ok: false, reason: "blocked" }, 451);
  }

  // (13) Run extraction. The engine call is the same one
  // authenticated /v1/ingest uses (Lens 06 §10 #6 — no fork).
  // The stub R2ObjectBody satisfies runExtraction's arrayBuffer()
  // contract without ever touching R2.
  const documentId = `demo_${crypto.randomUUID()}`;
  const orgId = "demo:anon";
  const stubObject = makeStubR2Object(bytes);
  const stubMessage: IngestQueueMessage = {
    id: documentId,
    document_id: documentId,
    org_id: orgId,
    owner_user_id: "demo:anon",
    retention_until: new Date(Date.now() + 60 * 60 * 1000).toISOString(),
    file_meta: {
      format: format as IngestQueueMessage["file_meta"]["format"],
      sha256: sha,
      size_bytes: bytes.byteLength,
      source: "drop",
    },
    enqueued_at: new Date().toISOString(),
  };

  const startedAt = Date.now();
  let attempt;
  try {
    attempt = await runExtraction(env, stubMessage, stubObject);
  } catch {
    return c.json({ ok: false, reason: "engine_error" }, 502);
  }
  const parseMs = Date.now() - startedAt;

  // (14) Increment counters AFTER extraction (so a 502 doesn't
  // count against the visitor) but BEFORE returning (so a slow
  // client can't replay-race).
  await Promise.all([
    kv.put(ipKey, String(ipRuns + 1), { expirationTtl: DAILY_TTL_SEC }),
    kv.put(budgetKey, String(budgetSpent + APPROX_COST_PER_RUN_CENTS), {
      expirationTtl: DAILY_TTL_SEC * 2,
    }),
  ]);

  if (!attempt.ok) {
    // Emit a privacy-clean failure event for the funnel — format
    // + reason only, NO bytes, NO vendor name (we don't have one
    // yet), NO IP.
    log(env, {
      event: "funnel.demo_upload_failed",
      level: "info",
      fields: { format, reason: attempt.reason, parse_ms: parseMs },
    });
    return c.json(
      { ok: false, reason: "extract_failed", inner: attempt.reason },
      502,
    );
  }

  // (15) Cache the successful result for 1h. KV-only; never the
  // bytes. The cached value contains the parsed invoice envelope
  // the client also sees; it is exactly what we returned to the
  // visitor, no more.
  const result: CachedExtractionResult = {
    invoice: attempt.invoice,
    needs_review: attempt.review.needs_review,
    needs_review_reasons: attempt.review.needs_review_reasons,
  };
  await kv.put(replayKey, JSON.stringify(result), {
    expirationTtl: REPLAY_TTL_SEC,
  });

  // (16) Emit the funnel.demo_upload_parsed event. PII-clean:
  // format + parse_ms + whether a vendor name came back (boolean,
  // not the name itself).
  const invoiceObj = attempt.invoice as unknown as {
    vendor?: { name?: { value?: string } | string };
  };
  const vendorNameValue =
    typeof invoiceObj?.vendor?.name === "string"
      ? invoiceObj.vendor.name
      : (invoiceObj?.vendor?.name as { value?: string } | undefined)?.value;
  log(env, {
    event: "funnel.demo_upload_parsed",
    level: "info",
    fields: {
      format,
      parse_ms: parseMs,
      vendor_recognised: !!vendorNameValue,
      ip_run_count: ipRuns + 1,
    },
  });

  return c.json({ ok: true, cached: false, ...result });
});

// ---- helpers ------------------------------------------------------

function parseIntOr(s: string | undefined, fallback: number): number {
  if (!s) return fallback;
  const n = parseInt(s, 10);
  return Number.isFinite(n) ? n : fallback;
}

async function sha256Hex(bytes: Uint8Array): Promise<string> {
  const hash = await crypto.subtle.digest("SHA-256", bytes);
  return Array.from(new Uint8Array(hash))
    .map((b) => b.toString(16).padStart(2, "0"))
    .join("");
}

async function verifyTurnstile(
  secret: string,
  token: string,
  ip: string,
): Promise<boolean> {
  try {
    const res = await fetch(
      "https://challenges.cloudflare.com/turnstile/v0/siteverify",
      {
        method: "POST",
        headers: { "content-type": "application/x-www-form-urlencoded" },
        body: new URLSearchParams({
          secret,
          response: token,
          remoteip: ip,
        }).toString(),
      },
    );
    if (!res.ok) return false;
    const data = (await res.json()) as { success?: boolean };
    return !!data.success;
  } catch {
    return false;
  }
}

/**
 * NCMEC hash-list match — STUBBED in P1.
 *
 * Lens 06 §9 names this as a mitigation against weaponised
 * uploads (someone using the anonymous demo to launder a known-
 * bad file through our infrastructure). The real binding
 * requires founder NCMEC enrolment + the published hash list as a
 * Workers KV import or fetch.
 *
 * Stubbed to always return false so the call-site exists; the
 * binding can be wired without touching the route shape. Document
 * as P1 debt in p1-done.md.
 */
async function ncmecMatchStub(_sha: string): Promise<boolean> {
  return false;
}

/**
 * Stub R2ObjectBody — satisfies the surface runExtraction reads
 * (arrayBuffer) from a Uint8Array held in this request's memory.
 *
 * The real R2ObjectBody has many more methods + properties;
 * runExtraction only calls .arrayBuffer(). We cast through unknown
 * because constructing a real R2ObjectBody would require an R2
 * write (which is exactly what the demo route MUST NOT do).
 */
function makeStubR2Object(bytes: Uint8Array): R2ObjectBody {
  const slice = bytes.buffer.slice(
    bytes.byteOffset,
    bytes.byteOffset + bytes.byteLength,
  );
  return {
    arrayBuffer: async () => slice,
  } as unknown as R2ObjectBody;
}

See also

This is the file as it lives at the moment of this build. The canonical history lives in git. If you want the full history or a specific commit, write to hello@muntin.digital.