RAG — Retrieval Augmented Generation
Learn how to augment language models with external knowledge. RAG lets LLMs access current information, domain expertise, and proprietary data—turning them from static pattern-matchers into living, learning systems. This lesson covers chunking, embeddings, vector search, and the complete pipeline.
Why RAG?
Large language models are powerful, but they have three fundamental limitations. First, their training data has a cutoff date—they know nothing about recent events. Second, they have no access to proprietary information (your company's documents, product specs, etc.). Third, they can hallucinate or confabulate facts when they're uncertain.
Retrieval Augmented Generation solves all three by letting models retrieve relevant documents from a knowledge base at inference time. Instead of relying on weights learned during training, the model now has access to external truth. This makes responses factual, up-to-date, and grounded.
When to use RAG
- ✓Answer questions about live/changing data
- ✓Incorporate company-specific knowledge
- ✓Reduce hallucination and improve accuracy
- ✓Audit the model's source information
Quick check
What is the primary advantage of RAG over standard LLMs?
The RAG Pipeline
RAG is a six-step process: query, embed, search, retrieve, augment, and generate. The entire pipeline runs at inference time.
End-to-End RAG Pipeline
Watch how a query flows through the retrieval augmented generation process
Query
User asks a question
Embed
Convert query to vectors
Search
Find similar chunks
Retrieve
Get top-k documents
Augment
Add context to prompt
Generate
LLM produces answer
Example Query
"What are the key benefits of RAG systems?"
Key stages
- Embed: Convert query to vector using same embedding model as docs
- Search: Find nearest neighbors in vector space (similarity search)
- Retrieve: Return top-k chunks from knowledge base
- Augment: Insert chunks into prompt as context
- Generate: LLM generates answer based on context + query
Trade-offs
RAG quality depends on retrieval quality. If the top-k chunks don't contain the answer, the LLM can't generate it correctly, even if it "knew" the answer.
Better retrieval = better answers. This is why chunking strategy and embedding model matter.
Quick check
Why does RAG quality depend primarily on retrieval, not the LLM?
Document Chunking
Before retrieval, documents must be split into chunks. The right chunk size balances context preservation with retrieval precision.
Document Chunking Strategy
Adjust chunk size and overlap to see how text is split for retrieval
Number of characters per chunk
Characters shared between consecutive chunks
Total chunks
15
Avg words/chunk
12
Overlap %
25%
Retrieval augmented generation is a technique that combines …
the strengths of large language models with external knowled…
ge sources. Instead of relying solely on the knowledge encod…
ed in model weights during training, RAG systems retrieve re…
levant documents or passages from a knowledge base and use t…
hem to augment the LLM's input prompt. This allows models to…
provide more accurate, up-to-date, and fact-grounded respon…
ses. The RAG pipeline consists of three main stages: first, …
the user query is embedded into a vector representation. Sec…
ond, similar documents are retrieved from the knowledge base…
using vector similarity search. Third, the retrieved docume…
nts are concatenated with the original query and fed to the …
language model, which generates the final response based on …
this enriched context. This approach has proven effective fo…
r question answering, fact verification, and domain-specific…
Chunk size guide
Too small (50 tokens)
Loses context, fragmented meaning, high overlap needed
Just right (256—512 tokens)
Fits context window, captures complete ideas, good relevance
Too large (1000+ tokens)
Adds irrelevant context, may exceed context limits, noisier
Overlap strategy
Overlap between chunks (typically 20% of chunk size) prevents important information from being split across chunk boundaries.
Quick check
Why is overlap between chunks important in RAG?
Embeddings & Vector Search
Embeddings convert text into vectors where semantic similarity becomes geometric closeness. Vector similarity search finds the most relevant chunks.
Vector Space Visualization
See how embeddings cluster similar chunks near the query vector
Top 3 most relevant chunks
Semantic search uses embeddings to find conceptually similar documents.
Vector embeddings transform text into numerical representations for similarity search.
Large language models are trained on massive amounts of text data.
Dense vs. Sparse Retrieval
Dense (semantic)
- • Uses embeddings: 384—1536 dimensions
- • Captures meaning & concepts
- • Works across paraphrases
- • Slower but more semantic
Sparse (keyword)
- • Uses TF-IDF or BM25
- • Exact term matches
- • Good for domain-specific terms
- • Faster but less robust
Hybrid search
Combines dense and sparse retrieval using Reciprocal Rank Fusion (RRF):
Hybrid gives best of both: semantic understanding + keyword precision.
Quick check
Cosine similarity in embedding space measures which property?
The 6 Steps Explained
Deep dive into each stage of the RAG pipeline with key details.
The 6 Steps of RAG
A guided walkthrough from query to answer
Key Takeaway
RAG makes LLMs grounded in real-time knowledge
LLMs trained on static data from the past
Knowledge decays over time (model becomes outdated)
Models cannot know about proprietary company data
RAG solves all three problems dynamically
RAG Playground
Try different queries and retrieval methods. See how dense, sparse, and hybrid approaches rank the same chunks differently.
RAG Playground
Enter a query and watch chunks ranked by relevance across retrieval methods
Hybrid (Both)
Combines dense and sparse methods via reciprocal rank fusion for best coverage.
Retrieved chunks (3)
Deep learning uses neural networks with multiple layers to process complex patterns and representati
Neural networks are inspired by biological neurons and use interconnected layers to transform inputs
lassification and regression tasks.
Quick check
When would sparse (TF-IDF) retrieval outperform dense (semantic) retrieval?
Quick check
In RAG, what happens if none of the top-k retrieved chunks contain the answer?
Key Takeaways
- ✓RAG augments LLMs with external knowledge at inference time, solving knowledge cutoff, hallucination, and domain expertise problems.
- ✓Chunking is foundational. The right chunk size (256—512 tokens) and overlap (20%) balance context preservation with retrieval precision.
- ✓Embeddings capture semantic meaning. Dense retrieval finds conceptually similar chunks; sparse retrieval finds keyword matches. Hybrid combines both.
- ✓Quality depends on retrieval, not the LLM. If chunks don't contain the answer, the LLM can't synthesize it—garbage in, garbage out.
- ✓Hybrid retrieval (RRF) is the safest default. It balances semantic and keyword precision, adapting to many query types.