AI and Source Documents: How to Stop LLMs from Hallucinating
AI and Source Documents: How to Stop LLMs from Hallucinating
TL;DR. LLM "hallucination" is a symptom, not a diagnosis. The diagnosis is poor retrieval and/or poor citation. This article is about how to build AI answers on client documents so that every statement has a reference to a specific fragment of a specific document, and any engineer can verify the answer in 10 seconds. Approach: citation-mandatory generation + hybrid retrieval + structural ingest + a solid eval set.
1. The Conflict: "The AI Made It All Up"
The scene: an AI assistant was deployed on client documents. A week later the manager writes: "It said our regulation states X. I opened the regulation — X is not there. What do we do?"
If the team's response is "well, that's an LLM hallucination, it happens, we'll work on the prompt" — the project is dead. The client has lost trust, and trust rarely returns.
This is a solvable problem. The solution is architectural, not "ask the LLM to be more careful." The key idea: every statement in the answer must have a reference to a specific chunk of a specific document, and the reference must be automatically verifiable.
2. Who This Concerns
- Regulated niches: occupational safety, medicine, law, accounting — where "the AI was wrong" = harm.
- Internal knowledge bases: where employees must trust the system and check quickly.
- Customer support: where a wrong answer means a return / complaint / churn.
- Any project where "the AI answered" is eventually checked by a human — and the time of that check multiplies by a thousand queries.
3. The Common Wrong Approach
- Take a standard RAG template: chunking → embeddings → top-K → LLM.
- Assume "the LLM is smart, it will figure out the sources itself."
- Get answers without citations. Or citations in the style of "according to document X" — without a page number.
- When a lawyer asks "show me exactly where" — there's no answer.
Additional pitfalls:
- Chunking by character count. Document structure is lost. Tables are split in half.
- Dense-only retrieval. Exact codes and identifiers are not found.
- Prompt "answer only from context" — without post-validation. The LLM ignores it when it wants to.
- No eval set for "refusals." Nobody has tested whether the system actually refuses to answer when the source doesn't contain the answer.
4. The Engineering Approach: Structural Ingest + Citation-Mandatory Generation
The pipeline as it should be:
flowchart LR RAW[PDF / DOCX / HTML] --> PARSE[Structured parser<br/>page, section, table] PARSE --> CHUNK[Smart chunking<br/>respect structure] CHUNK --> META[Metadata enrichment<br/>page, section, lang] META --> EMB[Embeddings + FTS] EMB --> PG[(Postgres + pgvector)] QUERY[User question] --> HYBRID[Hybrid retrieval<br/>BM25 + dense + RRF] HYBRID --> RERANK[Cross-encoder rerank] RERANK --> PROMPT[Citation-mandatory prompt] PROMPT --> LLM[LLM] LLM --> VALIDATE[Citation validator] VALIDATE -- ok --> ANSWER[Answer with markers<br/>doc:42#p:18#§:4.2] VALIDATE -- fail --> RETRY[Retry with top-3 force-fed] RETRY --> VALIDATE
Key elements:
4.1. Structural Parser
Not "PDF → text." But "PDF → sequence of (page, section, char_range, text, table?, image?)." For this:
- PDF —
pdfplumberorpymupdfwith coordinate preservation. - DOCX —
python-docxwalking style runs. - Complex scanned documents — OCR (Tesseract or PaddleOCR) with language detection.
Cost: ~2–4 seconds per page. Pays off on citation quality.
4.2. Smart Chunking
Rules we apply:
- Don't split a sentence mid-way.
- Don't split a table.
- If a section is short (< 500 tokens), the whole section = one chunk.
- If long — sliding window 400 tokens with overlap 80, reset at section boundary.
- Each chunk carries metadata:
{page, section, parent_section, lang, source_uri}.
This metadata then becomes the visible part of the citation.
4.3. Citation-Mandatory Prompt
Template (simplified):
You answer only based on the fragments below.
Every statement in the answer MUST end with a marker
in the form [doc:ID#chunk:N].
If the fragments contain no answer — respond with
"The provided sources contain no answer to this question."
Do not add facts that are not in the fragments.
Fragments:
[doc:42#chunk:7] (page 18, §4.2) ... text ...
[doc:42#chunk:8] (page 19, §4.3) ... text ...
...
Question: {question}
4.4. Post-Validation
After generation — simple code:
def validate_citations(answer: str, allowed_chunk_ids: set[str]) -> bool:
sentences = split_sentences(answer)
for s in sentences:
markers = re.findall(r"\[doc:(\d+)#chunk:(\d+)\]", s)
if not markers:
return False # sentence without citation
for doc_id, chunk_id in markers:
if f"{doc_id}#{chunk_id}" not in allowed_chunk_ids:
return False # citation to wrong source
return True
If validation fails — retry LLM call with top-3 chunks force-fed and a stricter prompt. If that also fails — the system responds "The provided sources contain no answer." This is a feature, not a bug.
5. Table: Criteria for "Real" Citation
| Criterion | Weak | Good | Excellent |
|---|---|---|---|
| Document identified | "per the regulation" | "per regulation X" | doc-id + name |
| Page identified | no | end of paragraph | per sentence |
| Section identified | no | chapter | chapter + paragraph |
| Citation automatically verifiable | no | regex | machine-readable + hyperlink |
| Reject when source absent | no | sometimes | always |
| Eval set for refusals | no | 5–10 questions | 30+ probe questions |
6. Sintaris Mini-Case
Worksafety Superassistant is our most demanding citation deployment. Corpus: ~3000 pages of occupational health and safety regulations (RU + EAEU + EU), plus client internal SOPs.
What was done:
- Structural parser preserving (page, section, paragraph).
- Hybrid retrieval with BM25 priority (regulations are keyword-rich).
- Citation-mandatory prompt + validator.
- Probe set of 30 "trap questions" (e.g. "What are the PPE norms for underwater welding at height?" — such a norm doesn't exist, expected answer: "not found").
- Eval set of 120 real questions with reference citations.
Metrics after 6 months:
- Citation accuracy (citation leads to chunk that actually supports the statement): 94%.
- Refusal correctness (on probe set — correct refusals): 96%.
- Median latency: 2.1 sec (including rerank).
- Complaints from safety officers per quarter: 0.
Details: Worksafety § 6 RAG pipeline.
7. Checklist (15 Points) for Citation-Mandatory RAG
- Parser preserves structure (page, section, paragraph) — not "whole PDF as one blob."
- Chunking respects structure — doesn't split sentences and tables.
- Metadata (page, section) is saved in every chunk.
- Hybrid retrieval (BM25 + dense + RRF) — not embeddings only.
- Cross-encoder rerank — top-12 → top-5.
- Prompt template requires citation markers.
- Post-validation checks for presence of markers.
- Refusal pattern — "not in sources" as a first-class behaviour.
- Eval set — minimum 30 questions with reference citations.
- Probe set for refusals — minimum 20 questions that cannot be answered.
- Citation accuracy is measured — shown in Grafana, not "manually sometimes."
- Refusal rate is measured — separately from accuracy.
- Drift monitor — checks weekly that metrics haven't dropped.
- Retrieval threshold — if top-1 score < X, refuse.
- Hyperlinks in answers — user clicks and sees the source document.
8. Risks
- Latency. Cross-encoder rerank + validation add 200–400 ms. Fine for business scenarios, bad for real-time chat. Solution: cache frequent queries.
- "Too cautious." If the threshold is too high — the system often says "I don't know." Fixed by calibrating on the eval set.
- Structural parsing of complex PDFs. Scanned tables are still painful. Budget time for OCR review.
- Multi-language citations. If regulations are in Russian but answers in German — citation labels must be language-stable. Better to use numeric IDs rather than "§4.2."
9. What to Do Next
If you already have a RAG pilot that occasionally "lies" — start with two actions:
- Enable citation-mandatory prompt + validator.
- Build a probe set of 20 questions that have no answer in the corpus, and measure refusal correctness.
Often these two steps are enough to lift quality from "sometimes right" to "always either right or honestly refuses."
If you want to go through the full cycle — AI Pilot over 4–8 weeks includes building citation-mandatory RAG as a standard pattern. For Slovenian companies, −25% from 1 to 30 June 2026.
10. References
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401.
- Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2004.04906.
- Cormack, G., Clarke, C., Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods. SIGIR '09.
- Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
- Es, S. et al. (2024). RAGAS: Automated Evaluation of Retrieval-Augmented Generation. arXiv:2309.15217.
- pdfplumber — https://github.com/jsvine/pdfplumber
- PaddleOCR — https://github.com/PaddlePaddle/PaddleOCR
Sintaris builds RAG systems with verifiable citations for businesses in the EU and CIS. Discovery call — free, 30 minutes.