Interactive~15 minIntermediate

Quantization: Compress models 8x without losing accuracy

Learn why modern AI models are quantized, how 32-bit floats become 4-bit integers, and why your phone can run LLMs that used to require $10,000 GPUs. With live interactive visualizations and a hands-on playground.

The problem How it works See it live Impact Methods Playground

Why quantization matters

Modern language models like GPT-4, Claude, and LLaMA have billions of parameters. Each parameter is typically stored as a 32-bit floating-point number, consuming massive amounts of memory and making inference slow.

The challenge: A 7-billion parameter model needs ~26 GB of VRAM just to store weights. That's beyond what most consumer GPUs can handle. Production servers must spread the model across multiple GPUs, adding latency and cost.

The solution: Quantization reduces each 32-bit float to 8 bits (or even 4 bits). The same 7B model shrinks to 3.5 GB—small enough for consumer hardware—while losing less than 1% accuracy.

Why it matters in practice

Quantization made it possible to run state-of-the-art models on phones, edge devices, and consumer GPUs. It's the difference between a $50k inference cluster and a MacBook.

FP32 (Unquantized)

26 GB

INT8 Quantized

3.25 GB

INT4 Quantized

1.6 GB

Memory savings for LLaMA-7B:

• FP32 → INT8: 8x smaller
• FP32 → INT4: 16x smaller

Quick check

A model quantized from FP32 to INT8 uses how much memory compared to the original?

You might think throwing away 24 of 32 bits would destroy a model. It doesn't, for three reasons:

Redundancy: Neural network weights have a lot of redundancy. Many parameters can be exactly the same without changing the model's behavior much.
Insensitive ranges: Weights outside the optimal quantization range (detected via calibration) matter much less than those near the center. The quantizer naturally focuses precision where it's most important.
Activation awareness: Advanced methods like AWQ weight each parameter by how much it affects the model's output. Less important weights get quantized more aggressively.

The mechanics: From float to integer

Step 1: Pick a range

Examine all model weights and find their min and max values. For example, weights might range from -2.5 to 3.1.

weights: [-2.5, -0.3, 0.1, 0.5, 1.2, 3.1, ...]
min = -2.5, max = 3.1, range = 5.6

Step 2: Calculate scale and zero-point

Map the float range to integer range. For INT8, that's [0, 255] or [-128, 127] depending on symmetry.

scale = (max - min) / 255 = 5.6 / 255 ≈ 0.022

zero_point = round(-min / scale) ≈ 114

Step 3: Quantize each value

Map each float to an integer using the formula: q = round(x / scale + zero_point)

-2.5 → round(-2.5 / 0.022 + 114) = 0

0.5 → round(0.5 / 0.022 + 114) = 137

3.1 → round(3.1 / 0.022 + 114) = 255

Step 4: Inference

During inference, dequantize INT8 values back to floats: x ≈ (q - zero_point) * scale

137 → (137 - 114) * 0.022 ≈ 0.506 (original: 0.5)

Error: 0.006 (way smaller than the value itself!)

Quick check

What does the 'zero-point' represent in quantization?

Symmetric: The range is mirrored around zero: [-A, A]. The zero-point is always 128 (the middle). Simple but might waste precision if ranges are unbalanced.

Asymmetric: Uses the full [min, max] range with a calculated zero-point. More flexible and often achieves better accuracy because it uses all 256 integer levels efficiently.

Rule of thumb: Use asymmetric quantization unless you have a specific reason not to (e.g., special hardware support for symmetric).

See it live: Number line quantization

Pick a float value and watch it snap to the nearest integer quantization level. See the error (the distance it moves) in real time.

Linear Quantization Visualization

Bit Width: 8

Symmetric

Float32 Range

-1.89-1.8923.35

Click a weight to analyze

Quantization Parameters

Scale:0.026299

Zero Point:128

Range:[-1.89, 3.35]

Mean Absolute Error:0.006323

How it works:

Each FP32 value is mapped to an Int8 value using a scale factor and zero point. The dequantized value is the closest approximation, and the error is what we sacrifice for smaller model size.

Quick check

If the quantization scale is 0.05 and zero-point is 100, what integer does 5.1 quantize to?

Weight distribution: Before and after

Compare how weights are distributed in the original model vs. after quantization. Lower bit widths cause "clumping" as multiple float values map to the same integer.

Weight Distribution Before & After Quantization

Method

Bit Width: 8

Bins: 32

Original (FP32)

-3.673.76

After Quantization (Int8)

-3.673.76

MAE

0.0072

RMSE

0.0084

Within 5% Error

78.1%

Metric	Original	Dequantized	Error
Min	-3.668	-3.671	0.002
Mean	-0.008	-0.008	0.000
Max	3.761	3.758	0.002

Understanding the histograms:

The blue histogram (left) shows how weights are distributed in the original model. The green histogram (right) shows the distribution after quantization and dequantization. Lower-bit widths cause more "clumping" as many values map to the same quantized level.

Quick check

In an INT4 quantized model, how many unique values can a parameter have?

With INT8, you have 256 possible values. With INT4, you only have 16. Imagine fitting a bell curve (your weight distribution) into 16 bins instead of 256. Many weights that were slightly different now map to the same integer. This is the accuracy loss, but it's often acceptable because the original differences were small and influenced the model's output very little.

Real impact: Model size and speed

See how quantization translates to practical memory and speed gains. Compare across different model sizes and precisions.

Model Size & Memory Comparison

Select a model size

Parameters: 7B

FP32 (32-bit)26702.88 GB

FP16 (16-bit)13351.44 GB

INT8 (8-bit)6675.72 GB

INT4 (4-bit)3337.86 GB

FP32 →INT8 Savings

20027.2 GB

75% reduction, 2-4x faster

FP32 →INT4 Savings

23365.0 GB

88% reduction, 4-8x faster

Precision	Bit Width	Size (GB)	vs FP32	Use Case
FP32	32	26702.88	—	Training, Research
FP16	16	13351.44	200%	GPU inference, Mixed precision
INT8	8	6675.72	400%	Mobile, Edge devices
INT4	4	3337.86	800%	Extreme compression

Real-world impact:

•INT8 saves enough space to fit 4.0x more models on the same device
•INT4 can run on consumer GPUs without GPU memory overflow
•Smaller models = faster inference = lower latency for end-users

Quick check

Which precision is typically used for production inference on consumer GPUs?

Memory savings alone would make quantization valuable. But hardware accelerators also run INT8 and INT4 math faster than FP32 math. Why?

Smaller operations mean fewer transistors, lower power, less heat.
Integer arithmetic is simpler than floating-point arithmetic.
Many GPUs have specialized INT8 units (tensor cores) that can do 4-8x more INT8 operations per clock than FP32 operations.

Result: INT8 models run 2-4x faster in practice. INT4 can reach 4-8x speedup, though with more accuracy loss.

Beyond linear quantization: GPTQ & AWQ

Simple linear quantization works well, but advanced methods squeeze out extra accuracy, especially at 4-bit widths.

GPTQ (Gradient-based)

Uses Hessian information (second-order derivatives) to identify which weights are most important to quantize carefully.

+Excellent accuracy at 4-bit

+Layer-by-layer quantization

−Slow calibration process

pip install auto-gptq

AWQ (Activation-Aware)

Analyzes activation distributions to identify which weights have the largest impact on the model's output.

+Fast calibration

+Great for 4-bit quantization

−Slightly lower accuracy than GPTQ on some models

pip install autoawq

Quick check

What is the main idea behind AWQ (Activation-Aware Quantization)?

Quantization quality depends heavily on choosing the right quantization range. You don't want to include extreme outliers (which would waste bits) or clip important values (which would lose accuracy).

Calibration means running the model on a representative dataset (usually 50-500 samples) and collecting statistics about weights and activations. Common calibration methods:

Min-Max: Use the actual min/max. Simple but sensitive to outliers.
Percentile (1%-99%): Ignore extreme outliers. Better for real data.
Entropy/KL-Divergence: Find the range that minimizes information loss. Best accuracy.

Step-by-step walkthrough

Go through all the key concepts in order with our interactive guided lesson.

Guided Quantization Walkthrough

Why

Number

Linear

Calibration

Advanced

Quality

The Core Problem

Modern language models have billions of parameters, each stored as 32-bit floats. A 7B parameter model takes ~26 GB of VRAM just to store weights. This is expensive and slow.

Without Quantization

• LLaMA-7B: 26 GB
• Requires high-end GPU
• Slow inference
• Limited deployment options

With INT4 Quantization

• LLaMA-7B: ~3.25 GB
• Runs on consumer GPU
• 4-8x faster inference
• Feasible for production

The Trade-off

Quantization sacrifices a small amount of model accuracy in exchange for huge gains in speed, memory, and deployment feasibility. In practice, well-quantized models lose <1% accuracy while being 8x smaller.

Step 1 of 6

Interactive playground

Experiment with different quantization methods and bit widths. See how accuracy and model size trade off in real time.

Interactive Quantization Playground

Bit Width: 8

2-bit (extreme)32-bit (original)

Quantization Method

Symmetric quantization

Model Size

Estimated Model Size

FP32 (original):26702.9 GB

At INT8:6675.72 GB

Space saved:75%

Accuracy Across Bit Widths

MAE

0.0065

Mean absolute error

RMSE

0.0075

Root mean square error

Within 5%

80.8%

Values within 5% error

Compression

400%

vs FP32 baseline

Detailed Analysis

Quantization method	linear
Bit width	8
Scale factor	0.026437
Zero point	131
Max error	0.013210
Model size (approx)	6675.72 GB

Playground tips:

• Drag the bit width slider to see how error changes
• Try different quantization methods to compare accuracy
• Higher bit widths = better accuracy but larger model size
• INT8 is the standard choice for most real-world deployments

Quick check

You have a 70B parameter model and want to run it on a device with 16 GB of VRAM. Which quantization is most likely to fit?

Key takeaways

8x compression

INT8 quantization shrinks FP32 models by 75% with minimal accuracy loss.

4-8x speedup

Hardware can execute INT8 and INT4 operations much faster than FP32.

Consumer GPUs

Quantized models fit on affordable hardware, democratizing AI.

Calibration matters

Entropy-based calibration on representative data gives best results.

Advanced methods

GPTQ and AWQ achieve better accuracy at extreme bit widths (2-4 bit).

Trade-off is real

Lower bits = smaller model but more accuracy loss. Choose based on constraints.

Quick check

What is the single most important factor in quantization quality?

1-bit quantization (Binarization): Taking quantization to the extreme. Some research models have been binarized, but practical deployment is limited due to severe accuracy loss.

Mixed-precision: Don't quantize all layers equally. Use FP32 for critical layers (embeddings, output), INT8 or INT4 for the bulk of the model.

Grouped quantization: Instead of one quantization range per layer, use different ranges for different groups of weights. More flexible, more overhead, but better accuracy.

Dynamic quantization: Quantization parameters adapt during inference based on activation statistics. Complex but can achieve state-of-the-art accuracy at low bit widths.

Want to dive deeper?

Check out these resources to learn more about quantization, model compression, and efficient inference:

→AutoGPTQ: Official implementation of GPTQ quantization. Works with most transformers models.
→AutoAWQ: Fast and accurate AWQ quantization with optimized kernels.
→Hugging Face bitsandbytes: Library for 8-bit and 4-bit inference with LoRA fine-tuning.
→OLLAMA: Run quantized models locally. Great for experimenting with different precisions.

Quantization: Compress models 8x without losing accuracy

Why quantization matters

The mechanics: From float to integer

Step 1: Pick a range

Step 2: Calculate scale and zero-point

Step 3: Quantize each value

Step 4: Inference

See it live: Number line quantization

Linear Quantization Visualization

Weight distribution: Before and after

Weight Distribution Before & After Quantization

Real impact: Model size and speed

Model Size & Memory Comparison

Beyond linear quantization: GPTQ & AWQ

GPTQ (Gradient-based)

AWQ (Activation-Aware)

Step-by-step walkthrough

Guided Quantization Walkthrough

The Core Problem

Interactive playground

Interactive Quantization Playground

Key takeaways

Want to dive deeper?

Finished this lesson?