AI and Personal Data: What You Can and Cannot Do Without Fines

2026-05-24 · Sintaris · ai, gdpr, 152-fz, dpia, rag, privacy, on-prem

AI and Personal Data: What You Can and Cannot Do Without Fines

TL;DR. If your AI project touches personal data of people in the EU — that's GDPR. If it's personal data of people in Russia — that's 152-FZ. If both — both laws simultaneously. In most cases the solution is not "choose the right cloud", but pipeline architecture: where the document lives, where the vectors live, where the prompt goes, where the LLM request goes, who sees the logs. Below is a practical breakdown of which patterns pass an audit and which do not, based on how SINTARIS deploys RAG systems in the EU and CIS.

1. The Conflict: "Let's Connect ChatGPT to Our CRM"

This request appears in almost every first project. Inside the company there is already a CRM with client names, phone numbers, and sometimes medical or financial details. Someone saw a ChatGPT demo, and the task emerges: "make it so the manager can ask — and it answers."

Technically that's two days of work. Legally — it's a potential fine of up to 4% of global turnover under GDPR (Article 83) and/or up to 18 million rubles under the updated 152-FZ (Article 13.11 of the Russian Administrative Code, 2024 edition). This is not a theoretical threat — Roskomnadzor and European DPAs (BfDI in Germany, Garante in Italy) already issue such fines for "transferred personal data to a third-party vendor without a legal basis."

The goal of this article is to show that the boundary is not where people usually draw it ("OpenAI = bad, local LLM = good"). The boundary lies in how exactly data moves through the system, which parts make you a controller, which make you a sub-processor, and where you must have a documented legal basis.

2. Who This Concerns

If your scenario contains not a single client name, email, account number or diagnosis — this article is not for you. If there's even one such item — it is.

3. The Common Wrong Approach

A pattern we've seen dozens of times:

  1. Export CRM data to CSV.
  2. Feed it into a RAG index through a standard n8n workflow.
  3. Attach a gpt-4o call via OpenAI API on top.
  4. Manager types in Telegram: "What did we discuss with Smith last month?" — the bot answers.
  5. Everything works, the demo succeeds, the client is happy.

Why this is unacceptable from a regulatory standpoint:

In Russia: Article 9 of 152-FZ requires written consent for processing special categories (health, political views, etc.). The client never gave consent for "transferring their interaction history with the manager to OpenAI."

4. The Engineering Approach: Where to Draw the Line

The key idea we repeat in every audit: the personal data boundary in an AI system must be visible on the architecture diagram. If it's not visible — it doesn't exist.

The baseline model we apply:

flowchart LR
  subgraph Client Internal Perimeter
    CRM[(CRM with PII)]
    KB[(RAG Knowledge Base)]
    LLM_LOCAL[Local LLM<br/>Ollama / llama.cpp]
    APP[Application / Bot]
  end
  subgraph External Services
    LLM_CLOUD[Cloud LLM<br/>OpenAI / Anthropic]
  end

  CRM -- sanitisation --> KB
  APP -- user question --> KB
  KB -- top-K chunks<br/>WITHOUT names --> LLM_LOCAL
  KB -. optionally, without PII .-> LLM_CLOUD

Two rules:

  1. PII does not cross the perimeter. Documents in the RAG index are either anonymised texts or texts with tokens (<CLIENT_ID:42> instead of "John Smith"). Name reconstruction happens on the application side after the LLM response, via a look-up in a protected table.
  2. Cloud LLM is optional and auditable. If it's needed at all, only an anonymised fragment is sent to it, and every such call is logged. If a regulator arrives — you have a log.

Steps to build this:

5. Diagram + Table: What Goes Where

A summary table of which data may be sent to which models under different legal bases:

Data category Local LLM (Ollama, your server) EU-located cloud LLM (Azure OpenAI EU, Mistral) US cloud LLM (OpenAI, Anthropic) RU-located cloud LLM (YandexGPT, GigaChat)
Public documents (regulations, catalogues) Yes Yes Yes Yes
Internal SOPs without PII Yes Yes Yes, with DPA Yes
Client email correspondence Yes Yes, with DPA + consent Anonymised only Anonymised only
Full names + client contacts Yes Anonymised only No Only with Russian data residency
Medical data / special categories Yes, with DPIA Only with explicit consent + DPIA No No, without separate legal basis
Financial / banking data Yes, with DPIA Anonymised only No No

This is a simplification. The real matrix for a specific project is built as part of a DPIA, but this table is a good starting point.

6. Sintaris Mini-Case

The Maria — clinic appointment notifier and Clinic assistant products demonstrate this approach in practice. Task: a patient should be able to confirm appointment times, reschedule, and receive a prescription copy via Telegram. The standard temptation: take gpt-4o-mini, feed it the entire PostgreSQL CRM, done.

Technical implementation:

Result: LLM cost — about €2 per month (classification is short requests), incoming calls to reception dropped 40–50%, and the legal documentation fits on one page because personal data physically never leaves the perimeter.

Architecture details — in the KB section Clinic assistant § 9 "Security & data boundary".

7. Checklist (15 Points) Before Launching an AI Pilot with Personal Data

  1. Purpose of processing is defined and documented for each data category.
  2. Legal basis is chosen and documented (consent / contract / legitimate interest / other).
  3. DPIA has been conducted where processing is systematic.
  4. DPA has been signed with every sub-processor (including LLM providers if used).
  5. SCCs / standard contractual clauses are signed for transfers outside the EEA.
  6. Article 30 GDPR records of processing activities are updated.
  7. PII does not go into embeddings directly — masking at the ingestion stage.
  8. Documents in the index are classified by sensitivity.
  9. Model dispatcher has a "sensitive → local-only" rule checked by automated test.
  10. LLM request logs do not contain PII (or are stored separately and briefly).
  11. Audit log of user and system actions exists and is retained for the required period.
  12. DSAR (right of access) mechanism is tested on a sample request once per quarter.
  13. Erasure (right to be forgotten) mechanism cascades through index, embeddings, and history.
  14. User consent is recorded with timestamp + policy version + IP.
  15. Privacy Policy is published, current, and translated into all site languages.

8. Risks

9. What to Do Next

If you already have an AI pilot with real client data — start with an audit using the checklist above. If you're only planning — build two rules into the architecture from the start: "PII does not cross the perimeter" and "model dispatcher with a policy." These are two architectural decisions that cannot be cheaply added later — but are cheap to include from the beginning.

If you'd rather have someone else do it — SINTARIS has a productised AI Audit that includes a DPIA-light + privacy architecture plan as standard. It's a package we run frequently enough that it costs a fixed price.

10. References


Sintaris audits processes and ships AI pilots for businesses in the EU and CIS. −25% discount on Audit and Pilot packages for Slovenian companies — 1 to 30 June 2026.