Quantization: Compress models 8x without losing accuracy
Learn why modern AI models are quantized, how 32-bit floats become 4-bit integers, and why your phone can run LLMs that used to require $10,000 GPUs. With live interactive visualizations and a hands-on playground.
Why quantization matters
Modern language models like GPT-4, Claude, and LLaMA have billions of parameters. Each parameter is typically stored as a 32-bit floating-point number, consuming massive amounts of memory and making inference slow.
The challenge: A 7-billion parameter model needs ~26 GB of VRAM just to store weights. That's beyond what most consumer GPUs can handle. Production servers must spread the model across multiple GPUs, adding latency and cost.
The solution: Quantization reduces each 32-bit float to 8 bits (or even 4 bits). The same 7B model shrinks to 3.5 GB—small enough for consumer hardware—while losing less than 1% accuracy.
Why it matters in practice
Quantization made it possible to run state-of-the-art models on phones, edge devices, and consumer GPUs. It's the difference between a $50k inference cluster and a MacBook.
FP32 (Unquantized)
26 GB
INT8 Quantized
3.25 GB
INT4 Quantized
1.6 GB
Memory savings for LLaMA-7B:
- • FP32 → INT8: 8x smaller
- • FP32 → INT4: 16x smaller
Quick check
A model quantized from FP32 to INT8 uses how much memory compared to the original?
You might think throwing away 24 of 32 bits would destroy a model. It doesn't, for three reasons:
- Redundancy: Neural network weights have a lot of redundancy. Many parameters can be exactly the same without changing the model's behavior much.
- Insensitive ranges: Weights outside the optimal quantization range (detected via calibration) matter much less than those near the center. The quantizer naturally focuses precision where it's most important.
- Activation awareness: Advanced methods like AWQ weight each parameter by how much it affects the model's output. Less important weights get quantized more aggressively.
The mechanics: From float to integer
Step 1: Pick a range
Examine all model weights and find their min and max values. For example, weights might range from -2.5 to 3.1.
min = -2.5, max = 3.1, range = 5.6
Step 2: Calculate scale and zero-point
Map the float range to integer range. For INT8, that's [0, 255] or [-128, 127] depending on symmetry.
Step 3: Quantize each value
Map each float to an integer using the formula: q = round(x / scale + zero_point)
Step 4: Inference
During inference, dequantize INT8 values back to floats: x ≈ (q - zero_point) * scale
Quick check
What does the 'zero-point' represent in quantization?
See it live: Number line quantization
Pick a float value and watch it snap to the nearest integer quantization level. See the error (the distance it moves) in real time.
Linear Quantization Visualization
Float32 Range
Click a weight to analyze
Quantization Parameters
How it works:
Each FP32 value is mapped to an Int8 value using a scale factor and zero point. The dequantized value is the closest approximation, and the error is what we sacrifice for smaller model size.
Quick check
If the quantization scale is 0.05 and zero-point is 100, what integer does 5.1 quantize to?
Weight distribution: Before and after
Compare how weights are distributed in the original model vs. after quantization. Lower bit widths cause "clumping" as multiple float values map to the same integer.
Weight Distribution Before & After Quantization
Original (FP32)
After Quantization (Int8)
MAE
0.0066
RMSE
0.0077
Within 5% Error
80.0%
| Metric | Original | Dequantized | Error |
|---|---|---|---|
| Min | -3.303 | -3.304 | 0.001 |
| Mean | -0.041 | -0.041 | 0.000 |
| Max | 3.437 | 3.436 | 0.001 |
Understanding the histograms:
The blue histogram (left) shows how weights are distributed in the original model. The green histogram (right) shows the distribution after quantization and dequantization. Lower-bit widths cause more "clumping" as many values map to the same quantized level.
Quick check
In an INT4 quantized model, how many unique values can a parameter have?
With INT8, you have 256 possible values. With INT4, you only have 16. Imagine fitting a bell curve (your weight distribution) into 16 bins instead of 256. Many weights that were slightly different now map to the same integer. This is the accuracy loss, but it's often acceptable because the original differences were small and influenced the model's output very little.
Real impact: Model size and speed
See how quantization translates to practical memory and speed gains. Compare across different model sizes and precisions.
Model Size & Memory Comparison
FP32 →INT8 Savings
20027.2 GB
75% reduction, 2-4x faster
FP32 →INT4 Savings
23365.0 GB
88% reduction, 4-8x faster
| Precision | Bit Width | Size (GB) | vs FP32 | Use Case |
|---|---|---|---|---|
| FP32 | 32 | 26702.88 | — | Training, Research |
| FP16 | 16 | 13351.44 | 200% | GPU inference, Mixed precision |
| INT8 | 8 | 6675.72 | 400% | Mobile, Edge devices |
| INT4 | 4 | 3337.86 | 800% | Extreme compression |
Real-world impact:
- •INT8 saves enough space to fit 4.0x more models on the same device
- •INT4 can run on consumer GPUs without GPU memory overflow
- •Smaller models = faster inference = lower latency for end-users
Quick check
Which precision is typically used for production inference on consumer GPUs?
Memory savings alone would make quantization valuable. But hardware accelerators also run INT8 and INT4 math faster than FP32 math. Why?
- Smaller operations mean fewer transistors, lower power, less heat.
- Integer arithmetic is simpler than floating-point arithmetic.
- Many GPUs have specialized INT8 units (tensor cores) that can do 4-8x more INT8 operations per clock than FP32 operations.
Result: INT8 models run 2-4x faster in practice. INT4 can reach 4-8x speedup, though with more accuracy loss.
Beyond linear quantization: GPTQ & AWQ
Simple linear quantization works well, but advanced methods squeeze out extra accuracy, especially at 4-bit widths.
GPTQ (Gradient-based)
Uses Hessian information (second-order derivatives) to identify which weights are most important to quantize carefully.
pip install auto-gptqAWQ (Activation-Aware)
Analyzes activation distributions to identify which weights have the largest impact on the model's output.
pip install autoawqQuick check
What is the main idea behind AWQ (Activation-Aware Quantization)?
Quantization quality depends heavily on choosing the right quantization range. You don't want to include extreme outliers (which would waste bits) or clip important values (which would lose accuracy).
Calibration means running the model on a representative dataset (usually 50-500 samples) and collecting statistics about weights and activations. Common calibration methods:
- Min-Max: Use the actual min/max. Simple but sensitive to outliers.
- Percentile (1%-99%): Ignore extreme outliers. Better for real data.
- Entropy/KL-Divergence: Find the range that minimizes information loss. Best accuracy.
Step-by-step walkthrough
Go through all the key concepts in order with our interactive guided lesson.
Guided Quantization Walkthrough
The Core Problem
Modern language models have billions of parameters, each stored as 32-bit floats. A 7B parameter model takes ~26 GB of VRAM just to store weights. This is expensive and slow.
Without Quantization
- • LLaMA-7B: 26 GB
- • Requires high-end GPU
- • Slow inference
- • Limited deployment options
With INT4 Quantization
- • LLaMA-7B: ~3.25 GB
- • Runs on consumer GPU
- • 4-8x faster inference
- • Feasible for production
The Trade-off
Quantization sacrifices a small amount of model accuracy in exchange for huge gains in speed, memory, and deployment feasibility. In practice, well-quantized models lose <1% accuracy while being 8x smaller.
Interactive playground
Experiment with different quantization methods and bit widths. See how accuracy and model size trade off in real time.
Interactive Quantization Playground
Estimated Model Size
Accuracy Across Bit Widths
MAE
0.0073
Mean absolute error
RMSE
0.0084
Root mean square error
Within 5%
78.7%
Values within 5% error
Compression
400%
vs FP32 baseline
Detailed Analysis
| Quantization method | linear |
| Bit width | 8 |
| Scale factor | 0.029014 |
| Zero point | 130 |
| Max error | 0.014506 |
| Model size (approx) | 6675.72 GB |
Playground tips:
- • Drag the bit width slider to see how error changes
- • Try different quantization methods to compare accuracy
- • Higher bit widths = better accuracy but larger model size
- • INT8 is the standard choice for most real-world deployments
Quick check
You have a 70B parameter model and want to run it on a device with 16 GB of VRAM. Which quantization is most likely to fit?
Key takeaways
8x compression
INT8 quantization shrinks FP32 models by 75% with minimal accuracy loss.
4-8x speedup
Hardware can execute INT8 and INT4 operations much faster than FP32.
Consumer GPUs
Quantized models fit on affordable hardware, democratizing AI.
Calibration matters
Entropy-based calibration on representative data gives best results.
Advanced methods
GPTQ and AWQ achieve better accuracy at extreme bit widths (2-4 bit).
Trade-off is real
Lower bits = smaller model but more accuracy loss. Choose based on constraints.
Quick check
What is the single most important factor in quantization quality?
1-bit quantization (Binarization): Taking quantization to the extreme. Some research models have been binarized, but practical deployment is limited due to severe accuracy loss.
Mixed-precision: Don't quantize all layers equally. Use FP32 for critical layers (embeddings, output), INT8 or INT4 for the bulk of the model.
Grouped quantization: Instead of one quantization range per layer, use different ranges for different groups of weights. More flexible, more overhead, but better accuracy.
Dynamic quantization: Quantization parameters adapt during inference based on activation statistics. Complex but can achieve state-of-the-art accuracy at low bit widths.
Want to dive deeper?
Check out these resources to learn more about quantization, model compression, and efficient inference:
- →AutoGPTQ: Official implementation of GPTQ quantization. Works with most transformers models.
- →AutoAWQ: Fast and accurate AWQ quantization with optimized kernels.
- →Hugging Face bitsandbytes: Library for 8-bit and 4-bit inference with LoRA fine-tuning.
- →OLLAMA: Run quantized models locally. Great for experimenting with different precisions.