AI and Personal Data: What You Can and Cannot Do Without Fines
AI and Personal Data: What You Can and Cannot Do Without Fines
TL;DR. If your AI project touches personal data of people in the EU — that's GDPR. If it's personal data of people in Russia — that's 152-FZ. If both — both laws simultaneously. In most cases the solution is not "choose the right cloud", but pipeline architecture: where the document lives, where the vectors live, where the prompt goes, where the LLM request goes, who sees the logs. Below is a practical breakdown of which patterns pass an audit and which do not, based on how SINTARIS deploys RAG systems in the EU and CIS.
1. The Conflict: "Let's Connect ChatGPT to Our CRM"
This request appears in almost every first project. Inside the company there is already a CRM with client names, phone numbers, and sometimes medical or financial details. Someone saw a ChatGPT demo, and the task emerges: "make it so the manager can ask — and it answers."
Technically that's two days of work. Legally — it's a potential fine of up to 4% of global turnover under GDPR (Article 83) and/or up to 18 million rubles under the updated 152-FZ (Article 13.11 of the Russian Administrative Code, 2024 edition). This is not a theoretical threat — Roskomnadzor and European DPAs (BfDI in Germany, Garante in Italy) already issue such fines for "transferred personal data to a third-party vendor without a legal basis."
The goal of this article is to show that the boundary is not where people usually draw it ("OpenAI = bad, local LLM = good"). The boundary lies in how exactly data moves through the system, which parts make you a controller, which make you a sub-processor, and where you must have a documented legal basis.
2. Who This Concerns
- Small and medium businesses with client data: clinics, schools, lawyers, agencies, B2B sales, any e-commerce.
- Startups that "bolted on GPT" to their product and postponed the legal review.
- IT department heads to whom "the business" brings an AI assistant idea and asks "when can we implement it?".
- Solo consultants and experts monetising their expertise through an AI bot with subscription.
If your scenario contains not a single client name, email, account number or diagnosis — this article is not for you. If there's even one such item — it is.
3. The Common Wrong Approach
A pattern we've seen dozens of times:
- Export CRM data to CSV.
- Feed it into a RAG index through a standard n8n workflow.
- Attach a
gpt-4ocall via OpenAI API on top. - Manager types in Telegram: "What did we discuss with Smith last month?" — the bot answers.
- Everything works, the demo succeeds, the client is happy.
Why this is unacceptable from a regulatory standpoint:
- Transfer of personal data to a third party without a legal basis. OpenAI becomes a sub-processor, but without a DPA (Data Processing Agreement) and without an analysis of where servers are physically located, how long requests are stored (by default — yes, they are), and who has access to them.
- The purpose of processing is not defined or documented. GDPR requires that for each data category there be an explicit, documented purpose. "More convenient for the manager" is not a purpose.
- The minimisation principle is violated. Far more data went into the index than is needed to answer the specific class of questions.
- No DPIA (Data Protection Impact Assessment) was conducted, even though it is mandatory for "systematic automated processing of large volumes of personal data" (GDPR Article 35).
In Russia: Article 9 of 152-FZ requires written consent for processing special categories (health, political views, etc.). The client never gave consent for "transferring their interaction history with the manager to OpenAI."
4. The Engineering Approach: Where to Draw the Line
The key idea we repeat in every audit: the personal data boundary in an AI system must be visible on the architecture diagram. If it's not visible — it doesn't exist.
The baseline model we apply:
flowchart LR
subgraph Client Internal Perimeter
CRM[(CRM with PII)]
KB[(RAG Knowledge Base)]
LLM_LOCAL[Local LLM<br/>Ollama / llama.cpp]
APP[Application / Bot]
end
subgraph External Services
LLM_CLOUD[Cloud LLM<br/>OpenAI / Anthropic]
end
CRM -- sanitisation --> KB
APP -- user question --> KB
KB -- top-K chunks<br/>WITHOUT names --> LLM_LOCAL
KB -. optionally, without PII .-> LLM_CLOUD
Two rules:
- PII does not cross the perimeter. Documents in the RAG index are either anonymised texts or texts with tokens (
<CLIENT_ID:42>instead of "John Smith"). Name reconstruction happens on the application side after the LLM response, via a look-up in a protected table. - Cloud LLM is optional and auditable. If it's needed at all, only an anonymised fragment is sent to it, and every such call is logged. If a regulator arrives — you have a log.
Steps to build this:
- Document classification before ingestion: label fields as PII / sensitive / public. Done once with a regex script + manual review.
- Masking of names, emails, phone numbers, dates of birth, policy numbers — during the chunking stage. Real values stay in
crm.contacts(id, value). - Local LLM by default. For most SMB scenarios,
llama3.1:8b-instructvia Ollama delivers sufficient quality. For higher load —qwen2.5:14b. - Model dispatcher with rules. Policy:
if has_pii then refuse_cloud_provider. This is not "reliability through discipline" — it's a config file checked by automated tests.
5. Diagram + Table: What Goes Where
A summary table of which data may be sent to which models under different legal bases:
| Data category | Local LLM (Ollama, your server) | EU-located cloud LLM (Azure OpenAI EU, Mistral) | US cloud LLM (OpenAI, Anthropic) | RU-located cloud LLM (YandexGPT, GigaChat) |
|---|---|---|---|---|
| Public documents (regulations, catalogues) | Yes | Yes | Yes | Yes |
| Internal SOPs without PII | Yes | Yes | Yes, with DPA | Yes |
| Client email correspondence | Yes | Yes, with DPA + consent | Anonymised only | Anonymised only |
| Full names + client contacts | Yes | Anonymised only | No | Only with Russian data residency |
| Medical data / special categories | Yes, with DPIA | Only with explicit consent + DPIA | No | No, without separate legal basis |
| Financial / banking data | Yes, with DPIA | Anonymised only | No | No |
This is a simplification. The real matrix for a specific project is built as part of a DPIA, but this table is a good starting point.
6. Sintaris Mini-Case
The Maria — clinic appointment notifier and Clinic assistant products demonstrate this approach in practice. Task: a patient should be able to confirm appointment times, reschedule, and receive a prescription copy via Telegram. The standard temptation: take gpt-4o-mini, feed it the entire PostgreSQL CRM, done.
Technical implementation:
- The LLM is used only for intent classification (
confirm/reschedule/cancel/question/emergency). - What goes into the prompt: message type (text / voice → transcript) + patient's first name (not full name) + city. No diagnoses, no prescriptions, no policy numbers.
- The booking itself — deterministic code that reads available slots from the calendar and writes directly to CRM.
- Prescription transfer — a separate workflow: the patient authenticates via a one-time code, the document is delivered through a secure channel, and the event is written to the audit log.
Result: LLM cost — about €2 per month (classification is short requests), incoming calls to reception dropped 40–50%, and the legal documentation fits on one page because personal data physically never leaves the perimeter.
Architecture details — in the KB section Clinic assistant § 9 "Security & data boundary".
7. Checklist (15 Points) Before Launching an AI Pilot with Personal Data
- Purpose of processing is defined and documented for each data category.
- Legal basis is chosen and documented (consent / contract / legitimate interest / other).
- DPIA has been conducted where processing is systematic.
- DPA has been signed with every sub-processor (including LLM providers if used).
- SCCs / standard contractual clauses are signed for transfers outside the EEA.
- Article 30 GDPR records of processing activities are updated.
- PII does not go into embeddings directly — masking at the ingestion stage.
- Documents in the index are classified by sensitivity.
- Model dispatcher has a "sensitive → local-only" rule checked by automated test.
- LLM request logs do not contain PII (or are stored separately and briefly).
- Audit log of user and system actions exists and is retained for the required period.
- DSAR (right of access) mechanism is tested on a sample request once per quarter.
- Erasure (right to be forgotten) mechanism cascades through index, embeddings, and history.
- User consent is recorded with timestamp + policy version + IP.
- Privacy Policy is published, current, and translated into all site languages.
8. Risks
- Regulatory. Fine + order to stop processing. Realistic range for EU SMB: €5–50k on a first incident, multiples higher on repeat.
- Reputational. "Data leak via AI" — a headline you don't want on your company.
- Technical. Vendor lock-in: migrating between LLM providers without a dispatcher architecture is painful.
- Team. Without a written policy, developers will "cut corners" — send PII to the cloud "for testing" and forget.
9. What to Do Next
If you already have an AI pilot with real client data — start with an audit using the checklist above. If you're only planning — build two rules into the architecture from the start: "PII does not cross the perimeter" and "model dispatcher with a policy." These are two architectural decisions that cannot be cheaply added later — but are cheap to include from the beginning.
If you'd rather have someone else do it — SINTARIS has a productised AI Audit that includes a DPIA-light + privacy architecture plan as standard. It's a package we run frequently enough that it costs a fixed price.
10. References
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401.
- Cormack, G., Clarke, C., Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods. SIGIR '09.
- Regulation (EU) 2016/679 (GDPR), Articles 5, 6, 30, 35, 83.
- Federal Law No. 152-FZ of 27.07.2006 "On Personal Data" (as amended 2024).
- Roskomnadzor — methodological recommendations on personal data processing (public materials 2023–2025).
- BfDI (Germany), Garante (Italy) — decision reviews on AI incidents, 2023–2025.
- SINTARIS internal KB: Taris § 9 Security & data boundary, Architecture patterns § 9 Local-first default.
Sintaris audits processes and ships AI pilots for businesses in the EU and CIS. −25% discount on Audit and Pilot packages for Slovenian companies — 1 to 30 June 2026.