Interactive~15 minBeginner

Embeddings — from words to vectors

Every language model assigns each word a coordinate in meaning-space. These embeddings are where semantic structure lives. Distance between vectors measures similarity, arithmetic on vectors encodes relationships, and clusters emerge from predicting context. Learn how 10,000-dimensional one-hot vectors compress into 768-dimensional dense spaces that pack in meaning.

Why words need numbers

Computers work with numbers, not words. When you feed text into a language model, it needs to convert each word into a numerical representation. The obvious approach is one-hot encoding: out of 10,000 words, mark exactly one position as 1 and the rest as 0.

This works, but it's wasteful. One-hot vectors are huge (10,000 dimensions), sparse (99.99% zeros), and worst of all: orthogonal. From a one-hot vector's perspective, "cat" and "dog" are as different as "cat" and "pizza." There's no signal that these are semantically similar.

The better way

Dense embeddings map words to small, dense vectors (typically 768–12,288 dimensions) where distance encodes meaning. Similar words are close together, and arithmetic operations on vectors encode relationships. This is where language models store semantic understanding.

One-Hot vs. Dense Embeddings

Sparse one-hot vectors have no semantic information. Dense embeddings pack meaning into a few dimensions.

Select a word:

One-hot encoding for 'cat'

Only one dimension is 1, the rest are 0. Showing first 100 of 10,000 dimensions.

...mostly zeros...
[1, 0, 0, 0, 0, 0, ..., 0, 0, 0]
10,000 elements, only 1 is non-zero

Problems with one-hot:

  • • No similarity info (cat and dog are orthogonal)
  • • Huge memory (10k dimensions per word)
  • • Models can't learn from other word's updates
PropertyOne-HotDense
Dimensions10,000+768-2048
Sparsity99.99% zerosAll non-zero
Similarity InfoNoneRich
LearnabilityIsolatedShared

Different models use different embedding sizes. GPT-3 uses 12,288 dimensions—huge, but it's a massive model. BERT uses 768. Smaller models like RoBERTa use 768. Embedding size is a trade-off: more dimensions mean more capacity to encode meaning, but also more parameters to train and larger memory footprint.

Interestingly, you can often compress embeddings (distillation) down to 128–256 dimensions without losing much semantic information. The semantic structure is robust—it doesn't require thousands of dimensions.

The theoretical upper bound is the vocabulary size (there are only so many distinct concepts to encode), but in practice, 768–2048 dimensions capture most semantic structure for language.

?

Quick check

What's the main disadvantage of one-hot encoding?

Similarity as distance

In embedding space, meaning is distance. Pick any two words—the closer their vectors, the more semantically related they are. But how do we measure distance? There are three common metrics, each with different properties. Try the calculator below to explore.

Similarity Explorer

Compare two words and see how the embedding space measures similarity.

Try presets:

Vector A: cat

X
-0.80
Y
0.60

Vector B: dog

X
-0.75
Y
0.70

Cosine Similarity

0.994

Very similar

Euclidean Distance

0.112

Close

Dot Product

1.020

Raw inner product

Angle

6.2°

Between vectors

Angular separation (cosine similarity basis)

cat

6°

angle

dog

Why different metrics?

  • Cosine similarity: Measures angle between vectors (ignores magnitude). Best for meaning.
  • Euclidean distance: Straight-line distance in embedding space. Sensitive to magnitude.
  • Dot product: Raw linear combination. Used in attention mechanisms and classifiers.

Cosine similarity measures the angle between vectors, ignoring their magnitude. Euclidean distance measures straight-line distance, sensitive to magnitude. For embeddings, cosine is usually better because it captures direction (meaning) independent of scale.

Imagine two word vectors that point in slightly different directions but have different lengths. Euclidean distance would say they're dissimilar (large distance), but cosine similarity says they're similar (small angle). For semantic meaning, the angle matters more than the magnitude.

That said, dot product (another name for raw inner product) is increasingly popular because it's used directly in attention mechanisms. Modern transformers use dot product to compute how much weight to give each word when computing attention, so semantically similar words naturally get higher weights.

?

Quick check

What does cosine similarity measure?

The semantic space: clusters emerge

When you train a language model to predict words from context, embeddings self-organize into semantic clusters. All animals cluster together, all countries cluster together, all emotions cluster together. The model never explicitly learned "create clusters"—they emerged from the prediction task.

The visualization below shows a 2D projection of real embedding space. Click words to find nearest neighbors. Try exploring different semantic domains.

Semantic Space Visualization

Words positioned by meaning. Similar words are close together. Click a word to see neighbors.

← negativepositive →

Click a word to explore its nearest neighbors

?

Quick check

Why do semantic clusters form in embedding space?

Arithmetic with meaning

One of the most striking properties of embeddings: vector arithmetic makes semantic sense. The classic example: king − man + woman ≈ queen.

This works because embeddings encode relationships as directions. The difference between "king" and "man" captures the concept of "royalty." Apply that same direction to "woman," and you land near "queen."

Analogy Explorer

Explore word analogies using vector arithmetic: A:B :: C:D becomes D = C + (B - A)

japan

Try presets:

Vector arithmetic

B - A = -0.15, -0.10
C = 0.70, 0.55
D = C + (B - A) = 0.55, 0.45

Top matches for D:

1. japan
0.000
2. india
0.042
3. germany
0.141

How does this work?

The difference vector (B - A) captures the relationship between A and B. When we apply this same relationship to C, we get a new point in embedding space that typically lands near the word that completes the analogy. Real word embeddings (like Word2Vec) produce remarkably accurate analogies!

In theory, yes. In practice, it depends on the relationship and the quality of embeddings. Simple relationships like city-country work well: paris − france + tokyo lands near japan. But more abstract relationships can be noisy.

The success rate depends on how clearly the relationship is encoded in the embedding space. Relationships that appear frequently in text (like city-country pairs) are very clean. Rare or ambiguous relationships are noisier.

Modern large language models can handle more complex reasoning because they have massive embedding spaces. But the core principle remains: relationships are directions, and arithmetic exploits that geometry.

?

Quick check

Why does vector arithmetic work on embeddings?

Step-by-step walkthrough

Walk through the complete story of embeddings, from the problem with one-hot encoding to how Word2Vec learns embeddings from prediction tasks. Each step has code, explanation, and interactive visualizations.

Words are just IDs (one-hot vectors)

Classically, we represent words as one-hot vectors: 10,000 words means 10,000 dimensions with a single 1. Problems: huge storage, no similarity info, and the model can't leverage patterns from similar words.

One-Hot vs. Dense Embeddings

Sparse one-hot vectors have no semantic information. Dense embeddings pack meaning into a few dimensions.

Select a word:

One-hot encoding for 'cat'

Only one dimension is 1, the rest are 0. Showing first 100 of 10,000 dimensions.

...mostly zeros...
[1, 0, 0, 0, 0, 0, ..., 0, 0, 0]
10,000 elements, only 1 is non-zero

Problems with one-hot:

  • • No similarity info (cat and dog are orthogonal)
  • • Huge memory (10k dimensions per word)
  • • Models can't learn from other word's updates
PropertyOne-HotDense
Dimensions10,000+768-2048
Sparsity99.99% zerosAll non-zero
Similarity InfoNoneRich
LearnabilityIsolatedShared

Deep dive: the one-hot problem

One-hot encoding seems natural—just use an index to identify each word. But it has fundamental problems that dense embeddings solve elegantly.

📦

Memory waste

10,000 words = 10,000 dimensions per word. Tiny dot products with huge vectors.

No similarity

All vectors are orthogonal. Dot product of any two is always zero.

🔒

No transfer

Learning about "cat" tells you nothing about "dog." No sharing of information.

How embeddings fix this

Compression: 768 dense dimensions instead of 10,000 sparse ones

Similarity structure: Close vectors indicate related meanings

Shared learning: Similar words update together, mutual improvement

Key takeaway

Embeddings are the bridge between symbols (words) and meaning (vectors). They compress information into dense spaces where distance = similarity and arithmetic = relationships. This geometric structure—emerging from nothing but prediction tasks—is where language models store their understanding. Modern LLMs are just giant embedding lookups followed by clever arithmetic on vectors.

Finished this lesson?

Mark it as complete to track your progress and get a certificate.