Data Protection Impact Assessment — Invoice Extraction
Draft, awaiting counsel review. GDPR Article 35 DPIA covering
the invoice-extraction pipeline (RoPA activity A2 in
docs/ropa.md). Written at the engineering level socounsel can verify each mitigation against a specific code path
or runbook. The published version (post counsel review) lives at
muntin.digital/ledger/dpia/invoice-extractionand supersedesthis draft.
Last updated: 2026-05-11 (draft) · Version: 0.1 (post-pivot, pre-private-beta) · Activity covered: RoPA A2.
1. Necessity and proportionality
The processing covered by this DPIA is the conversion of vendor invoices into structured ledger records. It is the core service the customer is paying for; without it there is no Muntin Ledger.
- Why this processing is necessary: a vendor invoice is an
unstructured PDF, photo, or email. To be useful for downstream accounting work it must be parsed into a structured record (vendor, totals, line items, dates, terms). No less-invasive processing achieves the customer's goal.
- Why this design is proportionate: the v4 architecture
deliberately runs as a deterministic pipeline with no LLM in the customer-data path. The engine resolves vendors against a seeded catalog; after roughly three confirmed invoices from the same vendor it has memorised the field positions on that vendor's layout via a per-vendor template. Subsequent invoices file silently after operator confirmation. The earlier v2 architecture routed extraction through a hosted LLM (Anthropic Claude under a zero-retention contract); the v4 pivot removed that hop, which shrinks the trust boundary and removes a class of model-training and prompt-leak risks at the cost of vendor-onboarding latency.
- Alternatives considered and rejected: (a) keeping the
zero-retention LLM hop — rejected because every additional sub-processor with plaintext access widens the data-flow surface; (b) a fully manual data-entry product — rejected because it does not solve the customer's problem; (c) a third-party OCR API — rejected because it adds a sub-processor with plaintext access for the same reason as (a). The deterministic engine plus open- source docling for layout was assessed as the least-data-flow design that delivers the service.
- Operator role: every extracted record is reviewed by an
operator at the customer before it commits to the ledger or to the audit chain. This is the human-in-the-loop commitment described in DPA §12.
2. Risks identified
We have identified the following risks to the rights and freedoms of data subjects from this processing activity. Each is sized by likelihood and severity in the v4 architecture, then mitigated in Section 3.
R1 — Cross-tenant exfiltration via templates
A per-vendor template is keyed by (org_id, vendor_id). A template also contains learned field positions (the vendor's invoice layout). If RLS predicates on extraction_templates or template_observations fail, an operator at organisation A could in principle observe organisation B's learned template, leaking the fact that organisation B uses that vendor — and, if the template encodes any of organisation B's invoice content, that content.
Likelihood: low — templates are inserted only on operator confirmation, and RLS is enforced at the table level per the patterns in infra/postgres/migrations/0015_rls_data_plane.sql. Severity: medium — disclosure of vendor relationships and possibly a slice of invoice content.
R2 — Audit-chain attestation forgery
The docling Machine runs the open-source layout engine in an ephemeral Fly Machine. On startup it posts an attestation identifying the image hash it is running. If an attacker substitutes a different image and forges the attestation, they could exfiltrate plaintext invoice content from worker memory during the request lifecycle.
Likelihood: low — attestations are signed and verified against per-deploy expected hashes (docs/release-attestations.md). Severity: high — silent plaintext exfiltration would compromise the data-flow promise that no third party processes invoice content.
R3 — Retention overrun
The org.retention_seconds setting controls how long raw invoice files persist in R2 after extraction. A bug in the R2 retention reaper (or a misconfigured customer override) could leave raw files in R2 past the intended window.
Likelihood: medium — retention reapers are state machines that fail in well-known ways (cron drift, partial deletes, R2 list pagination). Severity: medium — the data category is high- sensitivity (invoice content, possibly with SPI per Section 4 of the Privacy Policy's CCPA section) but the time over-run is bounded and the data is encrypted at rest.
R4 — Sub-processor compromise
A compromise of a sub-processor in the data-flow path (Cloudflare, Fly.io, Neon, AWS KMS, Resend) could affect Customer Data. Each sub-processor's exposure shape is documented at docs/sub-processors.md.
Likelihood: low — these are reputable providers with their own SOC2 and ISO 27001 attestations. Severity: variable depending on which provider — KMS compromise is the highest impact (it would let an attacker unwrap per-tenant DEKs); Resend compromise is the lowest (account-email metadata only).
R5 — Incidental SPI in invoice content
A service vendor may use a SSN on a W-9-derived invoice. We do not solicit SPI but it can land in invoice content. If SPI flows into Sentry error events or into the audit-log target reference, it expands the storage footprint of a data category we should not be storing.
Likelihood: medium — SSN-on-invoice is a known long-tail pattern. Severity: medium — the data is sensitive and is being processed for a purpose other than what it was provided for.
3. Mitigations
Each risk in Section 2 is mitigated by a specific technical control. Where the control is documented elsewhere, the link is the authoritative reference.
| Risk | Mitigation | Reference |
|---|---|---|
| R1 (cross-tenant via templates) | Row-level security on extraction_templates, template_observations, extractions, documents, audit_events. Worker-scoped signed URLs on R2 with per-org prefix and a SAFE_ID_RE-validated r2KeyFor() helper. Per-tenant DEKs so even if RLS fails the ciphertext is unreadable to the wrong tenant. |
infra/postgres/migrations/0015_rls_data_plane.sql; runbooks/tenant-isolation-incident.md |
| R2 (attestation forgery) | Planned ed25519-signed attestations (audit follow-on H-priv-3) replacing the current attestation 4xx-on-mismatch design. Until ed25519 signing lands, the attestation 4xx is treated as a chain-integrity event under DPA §10.2 and triggers the 24-hour notification SLA. Docling Machines are ephemeral per-job; no shared state between jobs. | DPA §10.2; runbooks/tenant-isolation-incident.md |
| R3 (retention overrun) | R2 retention reaper runs against org.retention_seconds. CI integration test asserts that a record older than its org's retention window is removed on the next reaper pass. R2 lifecycle policy is the second-layer defence per runbooks/r2-lifecycle.md. Customer-visible retention default is 24 hours; the customer chooses any extension up to 90 days. |
runbooks/r2-lifecycle.md |
| R4 (sub-processor compromise) | 30-day advance notice for any new sub-processor (DPA §6), enforced on CI by scripts/check-subprocessor-freshness.mjs. Per-sub-processor scope minimisation (e.g. AWS exposure is ciphertext-only, no plaintext invoice content; Resend sees only account emails). Per-tenant DEKs limit blast radius of KMS compromise to whichever tenants are unwrapped during the compromise window. |
DPA §6; docs/sub-processors.md |
| R5 (incidental SPI) | The PII scrubber at tools/pii-scrubber/redaction.py strips SSNs, US phone numbers, emails, and ACH-shaped account numbers from any Sentry error event and from audit-log target references before persistence. Invoice content itself is treated as confidential under DPA §3 and is encrypted at rest with a per-tenant DEK; it is not used to infer characteristics about a data subject. |
DPA §3; Privacy Policy → "For California residents (CCPA/CPRA)" → Sensitive PI |
4. Residual risk
After the mitigations in Section 3, the residual risk to data subjects is low, on the following reasoning:
- The processing is operator-confirmed at the verdict step. No
automated decision-making with legal or similarly significant effects is in scope.
- No LLM provider receives customer invoice content; the
scripts/no-llm-ci.sh gate makes this an architectural property, not a policy promise.
- The audit-chain integrity commitment in DPA §10.2 means a
customer learns about any break in the chain within 24 hours even if no Personal Data is implicated.
- Retention defaults are short (24 hours for raw files) and
customer-configurable; the long-retention surface (7 years on the audit log) is required by financial-records overlap and is the same regulatory floor every accounting product faces.
- The 30-day sub-processor notice plus the freshness-check CI gate
means a customer who objects to a new sub-processor has time to terminate before activation.
The dominant residual exposure is whichever provider sits closest to plaintext (Fly.io worker memory during the request lifecycle). This is reduced by ephemeral Machines, tmpfs scratch, and the fact that workers are scrubbed at job end. We accept this as the necessary cost of running the deterministic engine on managed infrastructure.
5. Review cadence
This DPIA is reviewed:
- Annually, on the anniversary of the latest counsel-reviewed
version.
- On any material change to the processing activity, including
but not limited to: introduction of an automated path that posts verdicts without operator confirmation (which is not planned; see DPA §12 for the 90-days-notice commitment); addition of a new sub-processor in the customer-data path; introduction of any inference step that draws conclusions about a data subject; expansion of the categories of personal data processed.
Each review is logged with date, reviewer, and a one-line summary of what changed (or "no material change") at the bottom of this document.
Review log
- 2026-05-11 — initial draft, written against the v4 architecture
(deterministic engine; per-vendor templates; no LLM in customer-data path). Pending counsel review.
Cross-references
- Record of Processing Activities (activity A2)
- Data Processing Agreement (especially §3, §6, §10.2,
§12)