Data Protection Impact Assessment — Invoice Extraction

Draft, awaiting counsel review. GDPR Article 35 DPIA covering
the invoice-extraction pipeline (RoPA activity A2 in
docs/ropa.md). Written at the engineering level so
counsel can verify each mitigation against a specific code path
or runbook. The published version (post counsel review) lives at
muntin.digital/ledger/dpia/invoice-extraction and supersedes
this draft.

Last updated: 2026-05-11 (draft) · Version: 0.1 (post-pivot, pre-private-beta) · Activity covered: RoPA A2.

1. Necessity and proportionality

The processing covered by this DPIA is the conversion of vendor invoices into structured ledger records. It is the core service the customer is paying for; without it there is no Muntin Ledger.

Why this processing is necessary: a vendor invoice is an

unstructured PDF, photo, or email. To be useful for downstream accounting work it must be parsed into a structured record (vendor, totals, line items, dates, terms). No less-invasive processing achieves the customer's goal.

Why this design is proportionate: the v4 architecture

deliberately runs as a deterministic pipeline with no LLM in the customer-data path. The engine resolves vendors against a seeded catalog; after roughly three confirmed invoices from the same vendor it has memorised the field positions on that vendor's layout via a per-vendor template. Subsequent invoices file silently after operator confirmation. The earlier v2 architecture routed extraction through a hosted LLM (Anthropic Claude under a zero-retention contract); the v4 pivot removed that hop, which shrinks the trust boundary and removes a class of model-training and prompt-leak risks at the cost of vendor-onboarding latency.

Alternatives considered and rejected: (a) keeping the

zero-retention LLM hop — rejected because every additional sub-processor with plaintext access widens the data-flow surface; (b) a fully manual data-entry product — rejected because it does not solve the customer's problem; (c) a third-party OCR API — rejected because it adds a sub-processor with plaintext access for the same reason as (a). The deterministic engine plus open- source docling for layout was assessed as the least-data-flow design that delivers the service.

Operator role: every extracted record is reviewed by an

operator at the customer before it commits to the ledger or to the audit chain. This is the human-in-the-loop commitment described in DPA §12.

2. Risks identified

We have identified the following risks to the rights and freedoms of data subjects from this processing activity. Each is sized by likelihood and severity in the v4 architecture, then mitigated in Section 3.

R1 — Cross-tenant exfiltration via templates

A per-vendor template is keyed by (org_id, vendor_id). A template also contains learned field positions (the vendor's invoice layout). If RLS predicates on extraction_templates or template_observations fail, an operator at organisation A could in principle observe organisation B's learned template, leaking the fact that organisation B uses that vendor — and, if the template encodes any of organisation B's invoice content, that content.

Likelihood: low — templates are inserted only on operator confirmation, and RLS is enforced at the table level per the patterns in infra/postgres/migrations/0015_rls_data_plane.sql. Severity: medium — disclosure of vendor relationships and possibly a slice of invoice content.

R2 — Audit-chain attestation forgery

The docling Machine runs the open-source layout engine in an ephemeral Fly Machine. On startup it posts an attestation identifying the image hash it is running. If an attacker substitutes a different image and forges the attestation, they could exfiltrate plaintext invoice content from worker memory during the request lifecycle.

Likelihood: low — attestations are signed and verified against per-deploy expected hashes (docs/release-attestations.md). Severity: high — silent plaintext exfiltration would compromise the data-flow promise that no third party processes invoice content.

R3 — Retention overrun

The org.retention_seconds setting controls how long raw invoice files persist in R2 after extraction. A bug in the R2 retention reaper (or a misconfigured customer override) could leave raw files in R2 past the intended window.

Likelihood: medium — retention reapers are state machines that fail in well-known ways (cron drift, partial deletes, R2 list pagination). Severity: medium — the data category is high- sensitivity (invoice content, possibly with SPI per Section 4 of the Privacy Policy's CCPA section) but the time over-run is bounded and the data is encrypted at rest.

R4 — Sub-processor compromise

A compromise of a sub-processor in the data-flow path (Cloudflare, Fly.io, Neon, AWS KMS, Resend) could affect Customer Data. Each sub-processor's exposure shape is documented at docs/sub-processors.md.

Likelihood: low — these are reputable providers with their own SOC2 and ISO 27001 attestations. Severity: variable depending on which provider — KMS compromise is the highest impact (it would let an attacker unwrap per-tenant DEKs); Resend compromise is the lowest (account-email metadata only).

R5 — Incidental SPI in invoice content

A service vendor may use a SSN on a W-9-derived invoice. We do not solicit SPI but it can land in invoice content. If SPI flows into Sentry error events or into the audit-log target reference, it expands the storage footprint of a data category we should not be storing.

Likelihood: medium — SSN-on-invoice is a known long-tail pattern. Severity: medium — the data is sensitive and is being processed for a purpose other than what it was provided for.

3. Mitigations

Each risk in Section 2 is mitigated by a specific technical control. Where the control is documented elsewhere, the link is the authoritative reference.

Risk	Mitigation	Reference
R1 (cross-tenant via templates)	Row-level security on `extraction_templates`, `template_observations`, `extractions`, `documents`, `audit_events`. Worker-scoped signed URLs on R2 with per-org prefix and a `SAFE_ID_RE`-validated `r2KeyFor()` helper. Per-tenant DEKs so even if RLS fails the ciphertext is unreadable to the wrong tenant.	`infra/postgres/migrations/0015_rls_data_plane.sql`; `runbooks/tenant-isolation-incident.md`
R2 (attestation forgery)	Planned ed25519-signed attestations (audit follow-on H-priv-3) replacing the current attestation 4xx-on-mismatch design. Until ed25519 signing lands, the attestation 4xx is treated as a chain-integrity event under DPA §10.2 and triggers the 24-hour notification SLA. Docling Machines are ephemeral per-job; no shared state between jobs.	DPA §10.2; `runbooks/tenant-isolation-incident.md`
R3 (retention overrun)	R2 retention reaper runs against `org.retention_seconds`. CI integration test asserts that a record older than its org's retention window is removed on the next reaper pass. R2 lifecycle policy is the second-layer defence per `runbooks/r2-lifecycle.md`. Customer-visible retention default is 24 hours; the customer chooses any extension up to 90 days.	`runbooks/r2-lifecycle.md`
R4 (sub-processor compromise)	30-day advance notice for any new sub-processor (DPA §6), enforced on CI by `scripts/check-subprocessor-freshness.mjs`. Per-sub-processor scope minimisation (e.g. AWS exposure is ciphertext-only, no plaintext invoice content; Resend sees only account emails). Per-tenant DEKs limit blast radius of KMS compromise to whichever tenants are unwrapped during the compromise window.	DPA §6; `docs/sub-processors.md`
R5 (incidental SPI)	The PII scrubber at `tools/pii-scrubber/redaction.py` strips SSNs, US phone numbers, emails, and ACH-shaped account numbers from any Sentry error event and from audit-log target references before persistence. Invoice content itself is treated as confidential under DPA §3 and is encrypted at rest with a per-tenant DEK; it is not used to infer characteristics about a data subject.	DPA §3; Privacy Policy → "For California residents (CCPA/CPRA)" → Sensitive PI

4. Residual risk

After the mitigations in Section 3, the residual risk to data subjects is low, on the following reasoning:

The processing is operator-confirmed at the verdict step. No

automated decision-making with legal or similarly significant effects is in scope.

No LLM provider receives customer invoice content; the

scripts/no-llm-ci.sh gate makes this an architectural property, not a policy promise.

The audit-chain integrity commitment in DPA §10.2 means a

customer learns about any break in the chain within 24 hours even if no Personal Data is implicated.

Retention defaults are short (24 hours for raw files) and

customer-configurable; the long-retention surface (7 years on the audit log) is required by financial-records overlap and is the same regulatory floor every accounting product faces.

The 30-day sub-processor notice plus the freshness-check CI gate

means a customer who objects to a new sub-processor has time to terminate before activation.

The dominant residual exposure is whichever provider sits closest to plaintext (Fly.io worker memory during the request lifecycle). This is reduced by ephemeral Machines, tmpfs scratch, and the fact that workers are scrubbed at job end. We accept this as the necessary cost of running the deterministic engine on managed infrastructure.

5. Review cadence

This DPIA is reviewed:

Annually, on the anniversary of the latest counsel-reviewed

version.

On any material change to the processing activity, including

but not limited to: introduction of an automated path that posts verdicts without operator confirmation (which is not planned; see DPA §12 for the 90-days-notice commitment); addition of a new sub-processor in the customer-data path; introduction of any inference step that draws conclusions about a data subject; expansion of the categories of personal data processed.

Each review is logged with date, reviewer, and a one-line summary of what changed (or "no material change") at the bottom of this document.

Review log

2026-05-11 — initial draft, written against the v4 architecture

(deterministic engine; per-vendor templates; no LLM in customer-data path). Pending counsel review.

Cross-references

Record of Processing Activities (activity A2)
Data Processing Agreement (especially §3, §6, §10.2,

§12)

Tenant-isolation incident