LoRA & Fine-tuning — Efficient model adaptation
Fine-tuning all 7 billion parameters costs 112GB of VRAM. LoRA achieves 99% parameter reduction by decomposing weight updates into two tiny matrices. Learn why low-rank decomposition works, how to pick the right rank, and trade-offs between LoRA, prefix tuning, and adapters.
Why fine-tune at all?
Pre-trained models like GPT-2, BERT, or LLaMA are trained on vast amounts of internet text. They learn grammar, facts, and general reasoning. But they're not specialized for your task.
Fine-tuning updates the model's weights to specialize for your domain: medical text, legal documents, code, customer support, etc. Usually just a few epochs of training suffice—the pre-trained weights already know the basics, you're just teaching task-specific patterns.
Pre-trained + task-specific
Pre-training gives foundational knowledge. Fine-tuning adds task-specific expertise with minimal data and time.
Examples
Clinical reasoning
Fine-tune on medical papers and case studies to improve clinical reasoning and safety.
Code generation
Specialize for your language, codebase style, or internal APIs with domain-specific examples.
Customer support
Adapt the model to your product, policies, and tone for consistent, on-brand responses.
Legal document analysis
Fine-tune to recognize contract types, clauses, and risks specific to your jurisdiction.
Quick check
Why do we fine-tune pre-trained models instead of training from scratch?
The full fine-tuning problem
Memory nightmare
A 7B parameter model uses:
- 28 GB: Model weights (7B × 4 bytes FP32)
- 28 GB: Gradients during backprop
- 28 GB: Optimizer states (Adam momentum + variance)
- 84 GB minimum VRAM per GPU
Other costs
- Time: Slow backprop through 7B parameters
- Storage: 28-30 GB checkpoint per task (copy for each fine-tune job)
- Catastrophic forgetting: Updating too many weights destroys pre-trained knowledge
- Task interference: Hard to do multi-task learning without conflicts
Quick check
What is the biggest challenge when full fine-tuning a 7B model?
During backpropagation, PyTorch stores:
- 1.Model weights (FP32): 7B × 4 bytes = 28 GB
- 2.Gradients (FP32): 7B × 4 bytes = 28 GB
- 3.Optimizer states (Adam uses 2 states per param): 7B × 2 × 4 bytes = 56 GB
Total: 28 + 28 + 56 = 112 GB for just a forward + backward pass. With activation checkpointing you can reduce this slightly, but you still need >80 GB VRAM. That means 2 × H100 (80 GB each) minimum, costing $12,000+ per month on cloud.
The low-rank insight
During fine-tuning, the weight updates don't explore the full rank of the weight matrix. Instead, they lie in a low-dimensional subspace.
Think of it like this: A 768×768 weight matrix has 589,824 degrees of freedom. But to adapt to your task, you might only need to move along 4–64 independent directions in that space.
This is supported by empirical findings: singular value decomposition (SVD) shows most information is captured by the top-k singular values. A rank-8 or rank-16 decomposition often recovers 99% of the adaptation needed.
The key insight
Weight update gradients are low-rank. Instead of updating all 589K parameters, update just two small matrices (A and B) that reconstruct the same update when multiplied.
Low-Rank Matrix Decomposition
A large weight matrix W (16×16) decomposes into two smaller matrices A and B, reducing parameters from 256 to 2 × (16 × 4) = 128.
W (16×16)
A (16×4)
B (4×16)
Original params
256.00
LoRA params
128.00
Reduction
50%
Quick check
What does "low-rank" mean in the context of weight updates?
Researchers computed singular value decomposition (SVD) on weight updates during fine-tuning:
- 1.Fine-tune a model on a downstream task
- 2.Compute U, Σ, V from SVD of each weight matrix delta (W_final - W_pretrained)
- 3.Plot singular values on a log scale
The result: singular values decay exponentially. The top 4–64 singular values (depending on model size) capture >95% of the Frobenius norm. This empirical finding motivated LoRA (Hu et al., 2021).
LoRA architecture
Instead of learning a full weight update ΔW, we decompose it into two low-rank matrices A and B:
h = (W + (α/r) ·B·A) · xHere:
- W:Original weight matrix (frozen during training)
- A:Input-to-hidden projection (d × r), randomly initialized
- B:Hidden-to-output projection (r × d), initialized to zero
- r:Rank (4, 8, 16, 32, or 64). Much smaller than d
- α:Scaling factor (usually 2·r) for gradient magnitude matching
Parameter comparison
- Full: d × d = 768 × 768 = 589,824 params
- LoRA: 2 × d × r = 2 × 768 × 16 = 24,576 params
- Reduction: (589,824 - 24,576) / 589,824 = 95.8%
Rank vs. Compression Tradeoff
Higher rank = better approximation but more parameters. Find your sweet spot. (Base model: 768D hidden dim, 12 layers)
LoRA params
147.5K
of 7.1e+6
Memory saved
0.0GB
vs full fine-tune
Approx quality
29%
of full capacity
Parameter reduction
97.9%
fewer to train
Practical insight
Rank 8 reduces parameters by 98% while maintaining 29% approximation quality. Consider higher rank for better task performance.
Quick check
In LoRA, why is matrix B initialized to zero?
The scaling factor α/r is crucial for gradient magnitude matching:
- 1.A is d × r (width r), B is r × d (width d). Without scaling, low-rank updates would have different gradient magnitudes than full updates.
- 2.Dividing by r makes the update magnitude consistent regardless of rank.
- 3.In practice, α = 2·r works well across different ranks (4, 8, 16, 32, 64).
Without this scaling, a rank-4 update would behave very differently from a rank-64 update with the same learning rate.
Comparing adaptation methods
LoRA is one of several parameter-efficient fine-tuning (PEFT) methods. Here's how they stack up:
Tuning Methods Comparison
Different approaches to adapt pre-trained models, ranked by efficiency and practical use.
Full Fine-Tuning
Train all model parameters
100.0%
LoRA
Low-rank weight updates
0.5%
Prefix Tuning
Learn task-specific prefixes
2.0%
Adapters
Insert bottleneck layers
3.0%
Why LoRA leads
LoRA achieves >99% parameter reduction, enables easy task-switching, trains fast, and can merge back into the base model for zero-cost inference. It's the sweet spot between full fine-tuning and minimal-parameter approaches.
Quick check
Why can LoRA be "merged" into the base model for deployment, while prefix tuning cannot?
LoRA strengths
- ✓99% parameter reduction
- ✓Merges into base model (zero-cost inference)
- ✓Multiple adapters per model
- ✓Simple to implement
LoRA considerations
- ~Rank selection requires tuning
- ~Where to apply LoRA (Q/K/V? FFN?) matters
- ~Slight approximation error vs. full fine-tune
- ~Separate adapter files to manage per task
6-step guided walkthrough
Step through the LoRA story: from fine-tuning motivation to practical deployment advice.
Step 1 of 6
Why Fine-Tune?
Pre-trained models are general-purpose. You want task-specific behavior.
A model trained on internet text knows grammar and facts, but not your domain
Fine-tuning updates weights to specialize the model for your task
Usually just a few epochs needed (unlike pre-training)
Example: fine-tune GPT-2 on medical texts to improve clinical reasoning
Interactive playground
Pick a model size and rank. See real-time parameter counts, memory usage, and approximation quality.
LoRA Configuration Playground
Pick a model size and rank, see real parameter counts and memory savings.
Full Fine-Tuning
Trainable parameters
7.1e+6
Memory (FP32)
0.0GB
LoRA (rank 16)
Trainable parameters
294.9K
Memory (FP32)
0.0GB
Memory saved
0.0GB
Speedup
24×
fewer params to train
Est. quality
58%
of full capacity
Approximation quality
58%
Good rank selection. Consider increasing for better performance.
Your LoRA configuration
• Model: Base (12L, 768D) (12 layers, 768D hidden)
• Rank r = 16 (recommended: 4–64 for most tasks)
• Alpha = 32 (scales update by 2.00× per rank)
• Per-layer LoRA: 24,576 params
• Would save 0.0GB compared to full fine-tuning
Quick check
For a 7B model (768D, 12 layers), rank-16 LoRA reduces parameters by:
No one-size-fits-all answer, but here's a practical strategy:
- Start small:Begin with rank 4–8. Monitor validation loss.
- Plateau detection:If loss plateaus early, increase rank. If it keeps improving, current rank is fine.
- Domain:Very different domains (e.g., English → medical) need higher rank. Similar domains use lower rank.
- Layers:Apply LoRA to attention (Q/K/V) first—highest impact. Then FFN layers if needed.
- Empirical rule:Rank 64 recovers ~95% of full fine-tuning performance across most tasks.
Key takeaways
Fine-tuning is essential, full fine-tuning is expensive
Pre-trained models need task-specific adaptation. But updating all 7B parameters costs 112GB VRAM and weeks of training.
Weight updates are low-rank
Singular value decomposition shows that fine-tuning updates live in low-dimensional subspaces. Rank 4–64 captures 99% of the adaptation.
LoRA achieves 99% parameter reduction
Two small matrices (A: d×r, B: r×d) replace one large matrix (d×d), cutting trainable parameters from 589K to 24K per layer.
LoRA merges into base model
Precompute B·A offline and add to W once. Deploy a single model with zero inference overhead. Store tiny adapter files (1–5 MB) instead of 30 GB checkpoints.
Rank selection matters
Start with rank 4–8, increase if validation plateaus. Rank 64 recovers ~95% of full fine-tuning performance. Domain similarity affects rank needed.
LoRA is the industry standard for parameter-efficient fine-tuning.
It balances extreme efficiency (99% reduction), strong performance (95%+ of full FT), easy deployment (merging), and simplicity. Used by Hugging Face, OpenAI adapters, Azure ML, and thousands of production systems.
Final check
Quick check
In production, you fine-tune a 7B model on customer support. After training, what do you deploy?