Interactive~18 minIntermediate

RAG — Retrieval Augmented Generation

Learn how to augment language models with external knowledge. RAG lets LLMs access current information, domain expertise, and proprietary data—turning them from static pattern-matchers into living, learning systems. This lesson covers chunking, embeddings, vector search, and the complete pipeline.

Why RAG?

Large language models are powerful, but they have three fundamental limitations. First, their training data has a cutoff date—they know nothing about recent events. Second, they have no access to proprietary information (your company's documents, product specs, etc.). Third, they can hallucinate or confabulate facts when they're uncertain.

Retrieval Augmented Generation solves all three by letting models retrieve relevant documents from a knowledge base at inference time. Instead of relying on weights learned during training, the model now has access to external truth. This makes responses factual, up-to-date, and grounded.

When to use RAG

  • Answer questions about live/changing data
  • Incorporate company-specific knowledge
  • Reduce hallucination and improve accuracy
  • Audit the model's source information
?

Quick check

What is the primary advantage of RAG over standard LLMs?

The RAG Pipeline

RAG is a six-step process: query, embed, search, retrieve, augment, and generate. The entire pipeline runs at inference time.

End-to-End RAG Pipeline

Watch how a query flows through the retrieval augmented generation process

📝

Query

User asks a question

🔢

Embed

Convert query to vectors

🔍

Search

Find similar chunks

📚

Retrieve

Get top-k documents

Augment

Add context to prompt

💬

Generate

LLM produces answer

Example Query

"What are the key benefits of RAG systems?"

Key stages

  • Embed: Convert query to vector using same embedding model as docs
  • Search: Find nearest neighbors in vector space (similarity search)
  • Retrieve: Return top-k chunks from knowledge base
  • Augment: Insert chunks into prompt as context
  • Generate: LLM generates answer based on context + query

Trade-offs

RAG quality depends on retrieval quality. If the top-k chunks don't contain the answer, the LLM can't generate it correctly, even if it "knew" the answer.

Better retrieval = better answers. This is why chunking strategy and embedding model matter.

?

Quick check

Why does RAG quality depend primarily on retrieval, not the LLM?

Document Chunking

Before retrieval, documents must be split into chunks. The right chunk size balances context preservation with retrieval precision.

Document Chunking Strategy

Adjust chunk size and overlap to see how text is split for retrieval

80

Number of characters per chunk

20

Characters shared between consecutive chunks

Total chunks

15

Avg words/chunk

12

Overlap %

25%

1080

Retrieval augmented generation is a technique that combines …

260140

the strengths of large language models with external knowled…

3120200

ge sources. Instead of relying solely on the knowledge encod…

4180260

ed in model weights during training, RAG systems retrieve re…

5240320

levant documents or passages from a knowledge base and use t…

6300380

hem to augment the LLM's input prompt. This allows models to…

7360440

provide more accurate, up-to-date, and fact-grounded respon…

8420500

ses. The RAG pipeline consists of three main stages: first, …

9480560

the user query is embedded into a vector representation. Sec…

10540620

ond, similar documents are retrieved from the knowledge base…

11600680

using vector similarity search. Third, the retrieved docume…

12660740

nts are concatenated with the original query and fed to the …

13720800

language model, which generates the final response based on …

14780860

this enriched context. This approach has proven effective fo…

15840914

r question answering, fact verification, and domain-specific…

Chunk size guide

Too small (50 tokens)

Loses context, fragmented meaning, high overlap needed

Just right (256—512 tokens)

Fits context window, captures complete ideas, good relevance

Too large (1000+ tokens)

Adds irrelevant context, may exceed context limits, noisier

Overlap strategy

Overlap between chunks (typically 20% of chunk size) prevents important information from being split across chunk boundaries.

Example with 100-char chunks, 20-char overlap:
Chunk 1: [0—100]
Chunk 2: [80—180]
Chunk 3: [160—260]
?

Quick check

Why is overlap between chunks important in RAG?

Embeddings & Vector Search

Embeddings convert text into vectors where semantic similarity becomes geometric closeness. Vector similarity search finds the most relevant chunks.

Vector Space Visualization

See how embeddings cluster similar chunks near the query vector

Query
Retrieved
Other

Top 3 most relevant chunks

#1
67%

Semantic search uses embeddings to find conceptually similar documents.

#2
59%

Vector embeddings transform text into numerical representations for similarity search.

#3
59%

Large language models are trained on massive amounts of text data.

Dense vs. Sparse Retrieval

Dense (semantic)

  • • Uses embeddings: 384—1536 dimensions
  • • Captures meaning & concepts
  • • Works across paraphrases
  • • Slower but more semantic

Sparse (keyword)

  • • Uses TF-IDF or BM25
  • • Exact term matches
  • • Good for domain-specific terms
  • • Faster but less robust

Hybrid search

Combines dense and sparse retrieval using Reciprocal Rank Fusion (RRF):

score = 1/(rank_dense + k) + 1/(rank_sparse + k)
(k=60 typical)

Hybrid gives best of both: semantic understanding + keyword precision.

?

Quick check

Cosine similarity in embedding space measures which property?

The 6 Steps Explained

Deep dive into each stage of the RAG pipeline with key details.

The 6 Steps of RAG

A guided walkthrough from query to answer

Key Takeaway

RAG makes LLMs grounded in real-time knowledge

LLMs trained on static data from the past

Knowledge decays over time (model becomes outdated)

Models cannot know about proprietary company data

RAG solves all three problems dynamically

Quality retrieval needs measurable metrics. Precision@k measures how many of the top-k results are relevant. Recall measures what fraction of all relevant documents you found. NDCG (Normalized Discounted Cumulative Gain) rewards getting relevant items at the top. For semantic search, you'd also measure MRR (Mean Reciprocal Rank) — the average position of the first relevant result.

RAG Playground

Try different queries and retrieval methods. See how dense, sparse, and hybrid approaches rank the same chunks differently.

RAG Playground

Enter a query and watch chunks ranked by relevance across retrieval methods

Hybrid (Both)

Combines dense and sparse methods via reciprocal rank fusion for best coverage.

Retrieved chunks (3)

1Relevance: 3%

Deep learning uses neural networks with multiple layers to process complex patterns and representati

2Relevance: 3%

Neural networks are inspired by biological neurons and use interconnected layers to transform inputs

3Relevance: 3%

lassification and regression tasks.

?

Quick check

When would sparse (TF-IDF) retrieval outperform dense (semantic) retrieval?

?

Quick check

In RAG, what happens if none of the top-k retrieved chunks contain the answer?

Simple RAG retrieves once per query. Advanced systems iterate: retrieve → generate intermediate reasoning → retrieve again with new queries → generate final answer. This recursive approach works like how humans research: find initial info, form questions, search for more specific info, then synthesize. Tools like self-RAG use the LLM itself to decide when more retrieval is needed, creating a feedback loop between generation and retrieval.
Moving from research to production introduces challenges. Vector databases (Pinecone, Weaviate, Milvus) handle millions of embeddings. Latency matters: every retrieval adds time. Cold-start problem: new documents haven't been chunked/embedded yet. Index staleness: when to re-index? Monitoring is critical — track retrieval precision, generation factuality, user satisfaction. A/B testing retrieval strategies (dense vs. sparse, different chunk sizes) is essential before deploying.

Key Takeaways

  • RAG augments LLMs with external knowledge at inference time, solving knowledge cutoff, hallucination, and domain expertise problems.
  • Chunking is foundational. The right chunk size (256—512 tokens) and overlap (20%) balance context preservation with retrieval precision.
  • Embeddings capture semantic meaning. Dense retrieval finds conceptually similar chunks; sparse retrieval finds keyword matches. Hybrid combines both.
  • Quality depends on retrieval, not the LLM. If chunks don't contain the answer, the LLM can't synthesize it—garbage in, garbage out.
  • Hybrid retrieval (RRF) is the safest default. It balances semantic and keyword precision, adapting to many query types.

Finished this lesson?

Mark it as complete to track your progress and get a certificate.