Introduction to RAG

Build answers from evidence, not memory.

A practical walkthrough of a .NET book-club RAG pipeline with Aspire, Qdrant, MinIO, Gemini or Ollama, and citation-backed answers.

Start the guide Open the repo

Illustration of a retrieval augmented generation pipeline from documents to citations.

What is RAG?

Retrieval-augmented generation connects a language model to your own evidence.

Large language models are excellent at language, but they do not automatically know your private documents, your latest operational data, or the exact passages a user needs to trust an answer. RAG adds a retrieval step before generation: store source material, search for relevant chunks, pass those chunks to the model, and return an answer with citations.

That makes RAG necessary whenever the answer should be grounded in a changing or private corpus: policies, tickets, books, customer records, manuals, research notes, or internal knowledge bases. The model writes the response, but the retrieval layer decides what evidence it is allowed to use.

In practice, the hard parts are not only model selection. A professional RAG system must make retrieval inspectable, distinguish primary source evidence from generated support material, evaluate answer behavior against known questions, and put guardrails around cost, latency, and user input.

UploadExtractChunkEmbedRetrieveAnswerEvaluate

Series Overview

This guide explains the current sample project as a source-code-backed series. It is written for engineers who already know basic C# and ASP.NET Core, but are still learning how modern RAG systems are assembled and evaluated.

The goal is not to present a perfect production architecture. The goal is to show how the pieces connect, where the boundaries are, and why those boundaries matter when building a document-ingestion and question-answering system in .NET.

Project workflow

The project implements this workflow:

1. Upload PDF/TXT
2. Store original file in object storage
3. Track metadata in SQLite
4. Worker extracts text
5. Generate book-club literary artifacts
6. Chunk source text and artifacts
7. Generate embeddings
8. Store vectors and citation payloads in Qdrant
9. Retrieve relevant chunks for a question
10. Send chunks to an LLM
11. Return answer + citations

At a high level, the system has six responsibilities:

Orchestration: Aspire starts the API, worker, Qdrant, MinIO, and optionally Ollama.
User interaction: The API hosts the upload and chat UI.
Durable state: SQLite tracks document status; MinIO stores originals; Qdrant stores vectors.
Ingestion: The worker converts files into searchable chunks.
Answering: The ask service retrieves evidence and asks an LLM to answer from that evidence.
Evaluation and operations: Tests, diagnostics, provenance, request limits, logging, and delete/reindex controls make the sample inspectable instead of opaque.

The most important design choice is that the API and worker do not know model-specific request formats. They depend on interfaces such as IEmbeddingProvider, IChatCompletionProvider, and IVectorStore. The same idea now applies inside retrieval: token estimation, reranking, ingestion work discovery, document management, diagnostics, and evaluation all have explicit seams so the sample can teach the engineering decisions behind RAG, not just the happy-path flow.

This guide is now maintained as the narrative source of truth for the RAGPipeline learning project. It tracks the current source code directly, including retrieval diagnostics, generated-artifact provenance, citation labeling, request guardrails, evaluation tests, and operational seams.

What this guide teaches

The engineering habits behind credible RAG systems.

The technical chapters walk through the implementation, but the project is also meant to show how experienced engineers think about RAG: preserve source material, make derived artifacts explicit, inspect retrieval, evaluate behavior, and keep operational limits visible.

Store original documents separately from vectors.
Track ingestion as durable state.
Keep long-running ingestion outside request/response paths.
Use provider-neutral abstractions for AI services.
Embed generated metadata, not only raw source text, and preserve provenance for it.
Tune retrieval based on expected question types.

Combine vector search with structured retrieval and simple exact-name fallback.
Make retrieval inspectable with diagnostics and rank reasons.
Return citations that distinguish source chunks from generated retrieval aids.
Surface ingestion failures and progress to the UI.
Use guardrails for question size, selected documents, retrieval expansion, and provider timeouts.
Test RAG behavior with deterministic golden-question evaluations.

Production hardening

What still needs more rigor before real deployment.

This sample is a teaching project with production-shaped seams. It is useful for learning architecture, retrieval behavior, and evaluation habits, but it is not a secure or scalable deployment template by itself.

Replace ad hoc SQLite schema updates with migrations.
Add authentication, authorization, auditing, and retention policy.
Support cloud object storage directly.
Add provider implementations for Azure OpenAI, Bedrock, Vertex AI, or OpenAI.

Add deeper observability around token usage, latency, provider errors, retrieval quality, and evaluation drift.
Improve PDF extraction quality.
Replace the database work source with queue infrastructure for multi-worker deployments.
Add provider-compatible tokenization, optional model-based reranking, and citation faithfulness checks.