AI and Personal Data: What You Can and Cannot Do Without Fines

2026-05-24 · Sintaris · ai, gdpr, 152-fz, dpia, rag, privacy, on-prem

AI and Personal Data: What You Can and Cannot Do Without Fines

TL;DR. If your AI project touches personal data of people in the EU — that's GDPR. If it's personal data of people in Russia — that's 152-FZ. If both — both laws simultaneously. In most cases the solution is not "choose the right cloud", but pipeline architecture: where the document lives, where the vectors live, where the prompt goes, where the LLM request goes, who sees the logs. Below is a practical breakdown of which patterns pass an audit and which do not, based on how SINTARIS deploys RAG systems in the EU and CIS.

1. The Conflict: "Let's Connect ChatGPT to Our CRM"

This request appears in almost every first project. Inside the company there is already a CRM with client names, phone numbers, and sometimes medical or financial details. Someone saw a ChatGPT demo, and the task emerges: "make it so the manager can ask — and it answers."

Technically that's two days of work. Legally — it's a potential fine of up to 4% of global turnover under GDPR (Article 83) and/or up to 18 million rubles under the updated 152-FZ (Article 13.11 of the Russian Administrative Code, 2024 edition). This is not a theoretical threat — Roskomnadzor and European DPAs (BfDI in Germany, Garante in Italy) already issue such fines for "transferred personal data to a third-party vendor without a legal basis."

The goal of this article is to show that the boundary is not where people usually draw it ("OpenAI = bad, local LLM = good"). The boundary lies in how exactly data moves through the system, which parts make you a controller, which make you a sub-processor, and where you must have a documented legal basis.

2. Who This Concerns

Small and medium businesses with client data: clinics, schools, lawyers, agencies, B2B sales, any e-commerce.
Startups that "bolted on GPT" to their product and postponed the legal review.
IT department heads to whom "the business" brings an AI assistant idea and asks "when can we implement it?".
Solo consultants and experts monetising their expertise through an AI bot with subscription.

If your scenario contains not a single client name, email, account number or diagnosis — this article is not for you. If there's even one such item — it is.

3. The Common Wrong Approach

A pattern we've seen dozens of times:

Export CRM data to CSV.
Feed it into a RAG index through a standard n8n workflow.
Attach a gpt-4o call via OpenAI API on top.
Manager types in Telegram: "What did we discuss with Smith last month?" — the bot answers.
Everything works, the demo succeeds, the client is happy.

Why this is unacceptable from a regulatory standpoint:

Transfer of personal data to a third party without a legal basis. OpenAI becomes a sub-processor, but without a DPA (Data Processing Agreement) and without an analysis of where servers are physically located, how long requests are stored (by default — yes, they are), and who has access to them.
The purpose of processing is not defined or documented. GDPR requires that for each data category there be an explicit, documented purpose. "More convenient for the manager" is not a purpose.
The minimisation principle is violated. Far more data went into the index than is needed to answer the specific class of questions.
No DPIA (Data Protection Impact Assessment) was conducted, even though it is mandatory for "systematic automated processing of large volumes of personal data" (GDPR Article 35).

In Russia: Article 9 of 152-FZ requires written consent for processing special categories (health, political views, etc.). The client never gave consent for "transferring their interaction history with the manager to OpenAI."

4. The Engineering Approach: Where to Draw the Line

The key idea we repeat in every audit: the personal data boundary in an AI system must be visible on the architecture diagram. If it's not visible — it doesn't exist.

The baseline model we apply:

flowchart LR
  subgraph Client Internal Perimeter
    CRM[(CRM with PII)]
    KB[(RAG Knowledge Base)]
    LLM_LOCAL[Local LLM<br/>Ollama / llama.cpp]
    APP[Application / Bot]
  end
  subgraph External Services
    LLM_CLOUD[Cloud LLM<br/>OpenAI / Anthropic]
  end

  CRM -- sanitisation --> KB
  APP -- user question --> KB
  KB -- top-K chunks<br/>WITHOUT names --> LLM_LOCAL
  KB -. optionally, without PII .-> LLM_CLOUD

Two rules:

PII does not cross the perimeter. Documents in the RAG index are either anonymised texts or texts with tokens (<CLIENT_ID:42> instead of "John Smith"). Name reconstruction happens on the application side after the LLM response, via a look-up in a protected table.
Cloud LLM is optional and auditable. If it's needed at all, only an anonymised fragment is sent to it, and every such call is logged. If a regulator arrives — you have a log.

Steps to build this:

Document classification before ingestion: label fields as PII / sensitive / public. Done once with a regex script + manual review.
Masking of names, emails, phone numbers, dates of birth, policy numbers — during the chunking stage. Real values stay in crm.contacts(id, value).
Local LLM by default. For most SMB scenarios, llama3.1:8b-instruct via Ollama delivers sufficient quality. For higher load — qwen2.5:14b.
Model dispatcher with rules. Policy: if has_pii then refuse_cloud_provider. This is not "reliability through discipline" — it's a config file checked by automated tests.

5. Diagram + Table: What Goes Where

A summary table of which data may be sent to which models under different legal bases:

Data category	Local LLM (Ollama, your server)	EU-located cloud LLM (Azure OpenAI EU, Mistral)	US cloud LLM (OpenAI, Anthropic)	RU-located cloud LLM (YandexGPT, GigaChat)
Public documents (regulations, catalogues)	Yes	Yes	Yes	Yes
Internal SOPs without PII	Yes	Yes	Yes, with DPA	Yes
Client email correspondence	Yes	Yes, with DPA + consent	Anonymised only	Anonymised only
Full names + client contacts	Yes	Anonymised only	No	Only with Russian data residency
Medical data / special categories	Yes, with DPIA	Only with explicit consent + DPIA	No	No, without separate legal basis
Financial / banking data	Yes, with DPIA	Anonymised only	No	No

This is a simplification. The real matrix for a specific project is built as part of a DPIA, but this table is a good starting point.

6. Sintaris Mini-Case

The Maria — clinic appointment notifier and Clinic assistant products demonstrate this approach in practice. Task: a patient should be able to confirm appointment times, reschedule, and receive a prescription copy via Telegram. The standard temptation: take gpt-4o-mini, feed it the entire PostgreSQL CRM, done.

Technical implementation:

The LLM is used only for intent classification (confirm / reschedule / cancel / question / emergency).
What goes into the prompt: message type (text / voice → transcript) + patient's first name (not full name) + city. No diagnoses, no prescriptions, no policy numbers.
The booking itself — deterministic code that reads available slots from the calendar and writes directly to CRM.
Prescription transfer — a separate workflow: the patient authenticates via a one-time code, the document is delivered through a secure channel, and the event is written to the audit log.

Result: LLM cost — about €2 per month (classification is short requests), incoming calls to reception dropped 40–50%, and the legal documentation fits on one page because personal data physically never leaves the perimeter.

Architecture details — in the KB section Clinic assistant § 9 "Security & data boundary".

7. Checklist (15 Points) Before Launching an AI Pilot with Personal Data

Purpose of processing is defined and documented for each data category.
Legal basis is chosen and documented (consent / contract / legitimate interest / other).
DPIA has been conducted where processing is systematic.
DPA has been signed with every sub-processor (including LLM providers if used).
SCCs / standard contractual clauses are signed for transfers outside the EEA.
Article 30 GDPR records of processing activities are updated.
PII does not go into embeddings directly — masking at the ingestion stage.
Documents in the index are classified by sensitivity.
Model dispatcher has a "sensitive → local-only" rule checked by automated test.
LLM request logs do not contain PII (or are stored separately and briefly).
Audit log of user and system actions exists and is retained for the required period.
DSAR (right of access) mechanism is tested on a sample request once per quarter.
Erasure (right to be forgotten) mechanism cascades through index, embeddings, and history.
User consent is recorded with timestamp + policy version + IP.
Privacy Policy is published, current, and translated into all site languages.

8. Risks

Regulatory. Fine + order to stop processing. Realistic range for EU SMB: €5–50k on a first incident, multiples higher on repeat.
Reputational. "Data leak via AI" — a headline you don't want on your company.
Technical. Vendor lock-in: migrating between LLM providers without a dispatcher architecture is painful.
Team. Without a written policy, developers will "cut corners" — send PII to the cloud "for testing" and forget.

9. What to Do Next

If you already have an AI pilot with real client data — start with an audit using the checklist above. If you're only planning — build two rules into the architecture from the start: "PII does not cross the perimeter" and "model dispatcher with a policy." These are two architectural decisions that cannot be cheaply added later — but are cheap to include from the beginning.

If you'd rather have someone else do it — SINTARIS has a productised AI Audit that includes a DPIA-light + privacy architecture plan as standard. It's a package we run frequently enough that it costs a fixed price.

10. References

Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401.
Cormack, G., Clarke, C., Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods. SIGIR '09.
Regulation (EU) 2016/679 (GDPR), Articles 5, 6, 30, 35, 83.
Federal Law No. 152-FZ of 27.07.2006 "On Personal Data" (as amended 2024).
Roskomnadzor — methodological recommendations on personal data processing (public materials 2023–2025).
BfDI (Germany), Garante (Italy) — decision reviews on AI incidents, 2023–2025.
SINTARIS internal KB: Taris § 9 Security & data boundary, Architecture patterns § 9 Local-first default.

Sintaris audits processes and ships AI pilots for businesses in the EU and CIS. −25% discount on Audit and Pilot packages for Slovenian companies — 1 to 30 June 2026.