AI and Source Documents: How to Stop LLMs from Hallucinating

2026-06-28 · Sintaris · rag, citations, hallucinations, grounding, document-ai, ocr

AI and Source Documents: How to Stop LLMs from Hallucinating

TL;DR. LLM "hallucination" is a symptom, not a diagnosis. The diagnosis is poor retrieval and/or poor citation. This article is about how to build AI answers on client documents so that every statement has a reference to a specific fragment of a specific document, and any engineer can verify the answer in 10 seconds. Approach: citation-mandatory generation + hybrid retrieval + structural ingest + a solid eval set.

1. The Conflict: "The AI Made It All Up"

The scene: an AI assistant was deployed on client documents. A week later the manager writes: "It said our regulation states X. I opened the regulation — X is not there. What do we do?"

If the team's response is "well, that's an LLM hallucination, it happens, we'll work on the prompt" — the project is dead. The client has lost trust, and trust rarely returns.

This is a solvable problem. The solution is architectural, not "ask the LLM to be more careful." The key idea: every statement in the answer must have a reference to a specific chunk of a specific document, and the reference must be automatically verifiable.

2. Who This Concerns

3. The Common Wrong Approach

  1. Take a standard RAG template: chunking → embeddings → top-K → LLM.
  2. Assume "the LLM is smart, it will figure out the sources itself."
  3. Get answers without citations. Or citations in the style of "according to document X" — without a page number.
  4. When a lawyer asks "show me exactly where" — there's no answer.

Additional pitfalls:

4. The Engineering Approach: Structural Ingest + Citation-Mandatory Generation

The pipeline as it should be:

flowchart LR
  RAW[PDF / DOCX / HTML] --> PARSE[Structured parser<br/>page, section, table]
  PARSE --> CHUNK[Smart chunking<br/>respect structure]
  CHUNK --> META[Metadata enrichment<br/>page, section, lang]
  META --> EMB[Embeddings + FTS]
  EMB --> PG[(Postgres + pgvector)]
  QUERY[User question] --> HYBRID[Hybrid retrieval<br/>BM25 + dense + RRF]
  HYBRID --> RERANK[Cross-encoder rerank]
  RERANK --> PROMPT[Citation-mandatory prompt]
  PROMPT --> LLM[LLM]
  LLM --> VALIDATE[Citation validator]
  VALIDATE -- ok --> ANSWER[Answer with markers<br/>doc:42#p:18#§:4.2]
  VALIDATE -- fail --> RETRY[Retry with top-3 force-fed]
  RETRY --> VALIDATE

Key elements:

4.1. Structural Parser

Not "PDF → text." But "PDF → sequence of (page, section, char_range, text, table?, image?)." For this:

Cost: ~2–4 seconds per page. Pays off on citation quality.

4.2. Smart Chunking

Rules we apply:

This metadata then becomes the visible part of the citation.

4.3. Citation-Mandatory Prompt

Template (simplified):

You answer only based on the fragments below.
Every statement in the answer MUST end with a marker
in the form [doc:ID#chunk:N].
If the fragments contain no answer — respond with
"The provided sources contain no answer to this question."
Do not add facts that are not in the fragments.

Fragments:
[doc:42#chunk:7] (page 18, §4.2) ... text ...
[doc:42#chunk:8] (page 19, §4.3) ... text ...
...

Question: {question}

4.4. Post-Validation

After generation — simple code:

def validate_citations(answer: str, allowed_chunk_ids: set[str]) -> bool:
    sentences = split_sentences(answer)
    for s in sentences:
        markers = re.findall(r"\[doc:(\d+)#chunk:(\d+)\]", s)
        if not markers:
            return False  # sentence without citation
        for doc_id, chunk_id in markers:
            if f"{doc_id}#{chunk_id}" not in allowed_chunk_ids:
                return False  # citation to wrong source
    return True

If validation fails — retry LLM call with top-3 chunks force-fed and a stricter prompt. If that also fails — the system responds "The provided sources contain no answer." This is a feature, not a bug.

5. Table: Criteria for "Real" Citation

Criterion Weak Good Excellent
Document identified "per the regulation" "per regulation X" doc-id + name
Page identified no end of paragraph per sentence
Section identified no chapter chapter + paragraph
Citation automatically verifiable no regex machine-readable + hyperlink
Reject when source absent no sometimes always
Eval set for refusals no 5–10 questions 30+ probe questions

6. Sintaris Mini-Case

Worksafety Superassistant is our most demanding citation deployment. Corpus: ~3000 pages of occupational health and safety regulations (RU + EAEU + EU), plus client internal SOPs.

What was done:

Metrics after 6 months:

Details: Worksafety § 6 RAG pipeline.

7. Checklist (15 Points) for Citation-Mandatory RAG

  1. Parser preserves structure (page, section, paragraph) — not "whole PDF as one blob."
  2. Chunking respects structure — doesn't split sentences and tables.
  3. Metadata (page, section) is saved in every chunk.
  4. Hybrid retrieval (BM25 + dense + RRF) — not embeddings only.
  5. Cross-encoder rerank — top-12 → top-5.
  6. Prompt template requires citation markers.
  7. Post-validation checks for presence of markers.
  8. Refusal pattern — "not in sources" as a first-class behaviour.
  9. Eval set — minimum 30 questions with reference citations.
  10. Probe set for refusals — minimum 20 questions that cannot be answered.
  11. Citation accuracy is measured — shown in Grafana, not "manually sometimes."
  12. Refusal rate is measured — separately from accuracy.
  13. Drift monitor — checks weekly that metrics haven't dropped.
  14. Retrieval threshold — if top-1 score < X, refuse.
  15. Hyperlinks in answers — user clicks and sees the source document.

8. Risks

9. What to Do Next

If you already have a RAG pilot that occasionally "lies" — start with two actions:

  1. Enable citation-mandatory prompt + validator.
  2. Build a probe set of 20 questions that have no answer in the corpus, and measure refusal correctness.

Often these two steps are enough to lift quality from "sometimes right" to "always either right or honestly refuses."

If you want to go through the full cycle — AI Pilot over 4–8 weeks includes building citation-mandatory RAG as a standard pattern. For Slovenian companies, −25% from 1 to 30 June 2026.

10. References


Sintaris builds RAG systems with verifiable citations for businesses in the EU and CIS. Discovery call — free, 30 minutes.