Interactive~15 minIntermediate

Quantization: Compress models 8x without losing accuracy

Learn why modern AI models are quantized, how 32-bit floats become 4-bit integers, and why your phone can run LLMs that used to require $10,000 GPUs. With live interactive visualizations and a hands-on playground.

Why quantization matters

Modern language models like GPT-4, Claude, and LLaMA have billions of parameters. Each parameter is typically stored as a 32-bit floating-point number, consuming massive amounts of memory and making inference slow.

The challenge: A 7-billion parameter model needs ~26 GB of VRAM just to store weights. That's beyond what most consumer GPUs can handle. Production servers must spread the model across multiple GPUs, adding latency and cost.

The solution: Quantization reduces each 32-bit float to 8 bits (or even 4 bits). The same 7B model shrinks to 3.5 GB—small enough for consumer hardware—while losing less than 1% accuracy.

Why it matters in practice

Quantization made it possible to run state-of-the-art models on phones, edge devices, and consumer GPUs. It's the difference between a $50k inference cluster and a MacBook.

FP32 (Unquantized)

26 GB

INT8 Quantized

3.25 GB

INT4 Quantized

1.6 GB

Memory savings for LLaMA-7B:

  • • FP32 → INT8: 8x smaller
  • • FP32 → INT4: 16x smaller
?

Quick check

A model quantized from FP32 to INT8 uses how much memory compared to the original?

You might think throwing away 24 of 32 bits would destroy a model. It doesn't, for three reasons:

  1. Redundancy: Neural network weights have a lot of redundancy. Many parameters can be exactly the same without changing the model's behavior much.
  2. Insensitive ranges: Weights outside the optimal quantization range (detected via calibration) matter much less than those near the center. The quantizer naturally focuses precision where it's most important.
  3. Activation awareness: Advanced methods like AWQ weight each parameter by how much it affects the model's output. Less important weights get quantized more aggressively.

The mechanics: From float to integer

Step 1: Pick a range

Examine all model weights and find their min and max values. For example, weights might range from -2.5 to 3.1.

weights: [-2.5, -0.3, 0.1, 0.5, 1.2, 3.1, ...]
min = -2.5, max = 3.1, range = 5.6

Step 2: Calculate scale and zero-point

Map the float range to integer range. For INT8, that's [0, 255] or [-128, 127] depending on symmetry.

scale = (max - min) / 255 = 5.6 / 255 ≈ 0.022
zero_point = round(-min / scale) ≈ 114

Step 3: Quantize each value

Map each float to an integer using the formula: q = round(x / scale + zero_point)

-2.5 → round(-2.5 / 0.022 + 114) = 0
0.5 → round(0.5 / 0.022 + 114) = 137
3.1 → round(3.1 / 0.022 + 114) = 255

Step 4: Inference

During inference, dequantize INT8 values back to floats: x ≈ (q - zero_point) * scale

137 → (137 - 114) * 0.022 ≈ 0.506 (original: 0.5)
Error: 0.006 (way smaller than the value itself!)
?

Quick check

What does the 'zero-point' represent in quantization?

Symmetric: The range is mirrored around zero: [-A, A]. The zero-point is always 128 (the middle). Simple but might waste precision if ranges are unbalanced.
Asymmetric: Uses the full [min, max] range with a calculated zero-point. More flexible and often achieves better accuracy because it uses all 256 integer levels efficiently.
Rule of thumb: Use asymmetric quantization unless you have a specific reason not to (e.g., special hardware support for symmetric).

See it live: Number line quantization

Pick a float value and watch it snap to the nearest integer quantization level. See the error (the distance it moves) in real time.

Linear Quantization Visualization

Float32 Range

-2.69-2.6932.20

Click a weight to analyze

Quantization Parameters

Scale:0.021124
Zero Point:128
Range:[-2.69, 2.20]
Mean Absolute Error:0.005532

How it works:

Each FP32 value is mapped to an Int8 value using a scale factor and zero point. The dequantized value is the closest approximation, and the error is what we sacrifice for smaller model size.

?

Quick check

If the quantization scale is 0.05 and zero-point is 100, what integer does 5.1 quantize to?

Weight distribution: Before and after

Compare how weights are distributed in the original model vs. after quantization. Lower bit widths cause "clumping" as multiple float values map to the same integer.

Weight Distribution Before & After Quantization

Original (FP32)

-3.303.44

After Quantization (Int8)

-3.303.44

MAE

0.0066

RMSE

0.0077

Within 5% Error

80.0%

MetricOriginalDequantizedError
Min-3.303-3.3040.001
Mean-0.041-0.0410.000
Max3.4373.4360.001

Understanding the histograms:

The blue histogram (left) shows how weights are distributed in the original model. The green histogram (right) shows the distribution after quantization and dequantization. Lower-bit widths cause more "clumping" as many values map to the same quantized level.

?

Quick check

In an INT4 quantized model, how many unique values can a parameter have?

With INT8, you have 256 possible values. With INT4, you only have 16. Imagine fitting a bell curve (your weight distribution) into 16 bins instead of 256. Many weights that were slightly different now map to the same integer. This is the accuracy loss, but it's often acceptable because the original differences were small and influenced the model's output very little.

Real impact: Model size and speed

See how quantization translates to practical memory and speed gains. Compare across different model sizes and precisions.

Model Size & Memory Comparison

Parameters: 7B
FP32 (32-bit)26702.88 GB
FP16 (16-bit)13351.44 GB
INT8 (8-bit)6675.72 GB
INT4 (4-bit)3337.86 GB

FP32 →INT8 Savings

20027.2 GB

75% reduction, 2-4x faster

FP32 →INT4 Savings

23365.0 GB

88% reduction, 4-8x faster

PrecisionBit WidthSize (GB)vs FP32Use Case
FP323226702.88Training, Research
FP161613351.44200%GPU inference, Mixed precision
INT886675.72400%Mobile, Edge devices
INT443337.86800%Extreme compression

Real-world impact:

  • INT8 saves enough space to fit 4.0x more models on the same device
  • INT4 can run on consumer GPUs without GPU memory overflow
  • Smaller models = faster inference = lower latency for end-users
?

Quick check

Which precision is typically used for production inference on consumer GPUs?

Memory savings alone would make quantization valuable. But hardware accelerators also run INT8 and INT4 math faster than FP32 math. Why?

  1. Smaller operations mean fewer transistors, lower power, less heat.
  2. Integer arithmetic is simpler than floating-point arithmetic.
  3. Many GPUs have specialized INT8 units (tensor cores) that can do 4-8x more INT8 operations per clock than FP32 operations.

Result: INT8 models run 2-4x faster in practice. INT4 can reach 4-8x speedup, though with more accuracy loss.

Beyond linear quantization: GPTQ & AWQ

Simple linear quantization works well, but advanced methods squeeze out extra accuracy, especially at 4-bit widths.

GPTQ (Gradient-based)

Uses Hessian information (second-order derivatives) to identify which weights are most important to quantize carefully.

+Excellent accuracy at 4-bit
+Layer-by-layer quantization
Slow calibration process
pip install auto-gptq

AWQ (Activation-Aware)

Analyzes activation distributions to identify which weights have the largest impact on the model's output.

+Fast calibration
+Great for 4-bit quantization
Slightly lower accuracy than GPTQ on some models
pip install autoawq
?

Quick check

What is the main idea behind AWQ (Activation-Aware Quantization)?

Quantization quality depends heavily on choosing the right quantization range. You don't want to include extreme outliers (which would waste bits) or clip important values (which would lose accuracy).

Calibration means running the model on a representative dataset (usually 50-500 samples) and collecting statistics about weights and activations. Common calibration methods:

  • Min-Max: Use the actual min/max. Simple but sensitive to outliers.
  • Percentile (1%-99%): Ignore extreme outliers. Better for real data.
  • Entropy/KL-Divergence: Find the range that minimizes information loss. Best accuracy.

Step-by-step walkthrough

Go through all the key concepts in order with our interactive guided lesson.

Guided Quantization Walkthrough

The Core Problem

Modern language models have billions of parameters, each stored as 32-bit floats. A 7B parameter model takes ~26 GB of VRAM just to store weights. This is expensive and slow.

Without Quantization

  • • LLaMA-7B: 26 GB
  • • Requires high-end GPU
  • • Slow inference
  • • Limited deployment options

With INT4 Quantization

  • • LLaMA-7B: ~3.25 GB
  • • Runs on consumer GPU
  • • 4-8x faster inference
  • • Feasible for production

The Trade-off

Quantization sacrifices a small amount of model accuracy in exchange for huge gains in speed, memory, and deployment feasibility. In practice, well-quantized models lose <1% accuracy while being 8x smaller.

Step 1 of 6

Interactive playground

Experiment with different quantization methods and bit widths. See how accuracy and model size trade off in real time.

Interactive Quantization Playground

2-bit (extreme)32-bit (original)

Estimated Model Size

FP32 (original):26702.9 GB
At INT8:6675.72 GB
Space saved:75%

Accuracy Across Bit Widths

0%100%Bit WidthAccuracy

MAE

0.0073

Mean absolute error

RMSE

0.0084

Root mean square error

Within 5%

78.7%

Values within 5% error

Compression

400%

vs FP32 baseline

Detailed Analysis

Quantization methodlinear
Bit width8
Scale factor0.029014
Zero point130
Max error0.014506
Model size (approx)6675.72 GB

Playground tips:

  • • Drag the bit width slider to see how error changes
  • • Try different quantization methods to compare accuracy
  • • Higher bit widths = better accuracy but larger model size
  • • INT8 is the standard choice for most real-world deployments
?

Quick check

You have a 70B parameter model and want to run it on a device with 16 GB of VRAM. Which quantization is most likely to fit?

Key takeaways

8x compression

INT8 quantization shrinks FP32 models by 75% with minimal accuracy loss.

4-8x speedup

Hardware can execute INT8 and INT4 operations much faster than FP32.

Consumer GPUs

Quantized models fit on affordable hardware, democratizing AI.

Calibration matters

Entropy-based calibration on representative data gives best results.

Advanced methods

GPTQ and AWQ achieve better accuracy at extreme bit widths (2-4 bit).

Trade-off is real

Lower bits = smaller model but more accuracy loss. Choose based on constraints.

?

Quick check

What is the single most important factor in quantization quality?

1-bit quantization (Binarization): Taking quantization to the extreme. Some research models have been binarized, but practical deployment is limited due to severe accuracy loss.

Mixed-precision: Don't quantize all layers equally. Use FP32 for critical layers (embeddings, output), INT8 or INT4 for the bulk of the model.

Grouped quantization: Instead of one quantization range per layer, use different ranges for different groups of weights. More flexible, more overhead, but better accuracy.

Dynamic quantization: Quantization parameters adapt during inference based on activation statistics. Complex but can achieve state-of-the-art accuracy at low bit widths.

Want to dive deeper?

Check out these resources to learn more about quantization, model compression, and efficient inference:

  • AutoGPTQ: Official implementation of GPTQ quantization. Works with most transformers models.
  • AutoAWQ: Fast and accurate AWQ quantization with optimized kernels.
  • Hugging Face bitsandbytes: Library for 8-bit and 4-bit inference with LoRA fine-tuning.
  • OLLAMA: Run quantized models locally. Great for experimenting with different precisions.

Finished this lesson?

Mark it as complete to track your progress and get a certificate.