Embeddings — from words to vectors
Every language model assigns each word a coordinate in meaning-space. These embeddings are where semantic structure lives. Distance between vectors measures similarity, arithmetic on vectors encodes relationships, and clusters emerge from predicting context. Learn how 10,000-dimensional one-hot vectors compress into 768-dimensional dense spaces that pack in meaning.
Why words need numbers
Computers work with numbers, not words. When you feed text into a language model, it needs to convert each word into a numerical representation. The obvious approach is one-hot encoding: out of 10,000 words, mark exactly one position as 1 and the rest as 0.
This works, but it's wasteful. One-hot vectors are huge (10,000 dimensions), sparse (99.99% zeros), and worst of all: orthogonal. From a one-hot vector's perspective, "cat" and "dog" are as different as "cat" and "pizza." There's no signal that these are semantically similar.
The better way
Dense embeddings map words to small, dense vectors (typically 768–12,288 dimensions) where distance encodes meaning. Similar words are close together, and arithmetic operations on vectors encode relationships. This is where language models store semantic understanding.
One-Hot vs. Dense Embeddings
Sparse one-hot vectors have no semantic information. Dense embeddings pack meaning into a few dimensions.
Select a word:
One-hot encoding for 'cat'
Only one dimension is 1, the rest are 0. Showing first 100 of 10,000 dimensions.
Problems with one-hot:
- • No similarity info (cat and dog are orthogonal)
- • Huge memory (10k dimensions per word)
- • Models can't learn from other word's updates
| Property | One-Hot | Dense |
|---|---|---|
| Dimensions | 10,000+ | 768-2048 |
| Sparsity | 99.99% zeros | All non-zero |
| Similarity Info | None | Rich |
| Learnability | Isolated | Shared |
Different models use different embedding sizes. GPT-3 uses 12,288 dimensions—huge, but it's a massive model. BERT uses 768. Smaller models like RoBERTa use 768. Embedding size is a trade-off: more dimensions mean more capacity to encode meaning, but also more parameters to train and larger memory footprint.
Interestingly, you can often compress embeddings (distillation) down to 128–256 dimensions without losing much semantic information. The semantic structure is robust—it doesn't require thousands of dimensions.
The theoretical upper bound is the vocabulary size (there are only so many distinct concepts to encode), but in practice, 768–2048 dimensions capture most semantic structure for language.
Quick check
What's the main disadvantage of one-hot encoding?
Similarity as distance
In embedding space, meaning is distance. Pick any two words—the closer their vectors, the more semantically related they are. But how do we measure distance? There are three common metrics, each with different properties. Try the calculator below to explore.
Similarity Explorer
Compare two words and see how the embedding space measures similarity.
Try presets:
Vector A: cat
Vector B: dog
Cosine Similarity
0.994
Very similar
Euclidean Distance
0.112
Close
Dot Product
1.020
Raw inner product
Angle
6.2°
Between vectors
Angular separation (cosine similarity basis)
6°
angle
Why different metrics?
- Cosine similarity: Measures angle between vectors (ignores magnitude). Best for meaning.
- Euclidean distance: Straight-line distance in embedding space. Sensitive to magnitude.
- Dot product: Raw linear combination. Used in attention mechanisms and classifiers.
Cosine similarity measures the angle between vectors, ignoring their magnitude. Euclidean distance measures straight-line distance, sensitive to magnitude. For embeddings, cosine is usually better because it captures direction (meaning) independent of scale.
Imagine two word vectors that point in slightly different directions but have different lengths. Euclidean distance would say they're dissimilar (large distance), but cosine similarity says they're similar (small angle). For semantic meaning, the angle matters more than the magnitude.
That said, dot product (another name for raw inner product) is increasingly popular because it's used directly in attention mechanisms. Modern transformers use dot product to compute how much weight to give each word when computing attention, so semantically similar words naturally get higher weights.
Quick check
What does cosine similarity measure?
The semantic space: clusters emerge
When you train a language model to predict words from context, embeddings self-organize into semantic clusters. All animals cluster together, all countries cluster together, all emotions cluster together. The model never explicitly learned "create clusters"—they emerged from the prediction task.
The visualization below shows a 2D projection of real embedding space. Click words to find nearest neighbors. Try exploring different semantic domains.
Semantic Space Visualization
Words positioned by meaning. Similar words are close together. Click a word to see neighbors.
Click a word to explore its nearest neighbors
Quick check
Why do semantic clusters form in embedding space?
Arithmetic with meaning
One of the most striking properties of embeddings: vector arithmetic makes semantic sense. The classic example: king − man + woman ≈ queen.
This works because embeddings encode relationships as directions. The difference between "king" and "man" captures the concept of "royalty." Apply that same direction to "woman," and you land near "queen."
Analogy Explorer
Explore word analogies using vector arithmetic: A:B :: C:D becomes D = C + (B - A)
Try presets:
Vector arithmetic
Top matches for D:
How does this work?
The difference vector (B - A) captures the relationship between A and B. When we apply this same relationship to C, we get a new point in embedding space that typically lands near the word that completes the analogy. Real word embeddings (like Word2Vec) produce remarkably accurate analogies!
In theory, yes. In practice, it depends on the relationship and the quality of embeddings. Simple relationships like city-country work well: paris − france + tokyo lands near japan. But more abstract relationships can be noisy.
The success rate depends on how clearly the relationship is encoded in the embedding space. Relationships that appear frequently in text (like city-country pairs) are very clean. Rare or ambiguous relationships are noisier.
Modern large language models can handle more complex reasoning because they have massive embedding spaces. But the core principle remains: relationships are directions, and arithmetic exploits that geometry.
Quick check
Why does vector arithmetic work on embeddings?
Step-by-step walkthrough
Walk through the complete story of embeddings, from the problem with one-hot encoding to how Word2Vec learns embeddings from prediction tasks. Each step has code, explanation, and interactive visualizations.
Words are just IDs (one-hot vectors)
Classically, we represent words as one-hot vectors: 10,000 words means 10,000 dimensions with a single 1. Problems: huge storage, no similarity info, and the model can't leverage patterns from similar words.
One-Hot vs. Dense Embeddings
Sparse one-hot vectors have no semantic information. Dense embeddings pack meaning into a few dimensions.
Select a word:
One-hot encoding for 'cat'
Only one dimension is 1, the rest are 0. Showing first 100 of 10,000 dimensions.
Problems with one-hot:
- • No similarity info (cat and dog are orthogonal)
- • Huge memory (10k dimensions per word)
- • Models can't learn from other word's updates
| Property | One-Hot | Dense |
|---|---|---|
| Dimensions | 10,000+ | 768-2048 |
| Sparsity | 99.99% zeros | All non-zero |
| Similarity Info | None | Rich |
| Learnability | Isolated | Shared |
Deep dive: the one-hot problem
One-hot encoding seems natural—just use an index to identify each word. But it has fundamental problems that dense embeddings solve elegantly.
Memory waste
10,000 words = 10,000 dimensions per word. Tiny dot products with huge vectors.
No similarity
All vectors are orthogonal. Dot product of any two is always zero.
No transfer
Learning about "cat" tells you nothing about "dog." No sharing of information.
How embeddings fix this
✓ Compression: 768 dense dimensions instead of 10,000 sparse ones
✓ Similarity structure: Close vectors indicate related meanings
✓ Shared learning: Similar words update together, mutual improvement
Key takeaway
Embeddings are the bridge between symbols (words) and meaning (vectors). They compress information into dense spaces where distance = similarity and arithmetic = relationships. This geometric structure—emerging from nothing but prediction tasks—is where language models store their understanding. Modern LLMs are just giant embedding lookups followed by clever arithmetic on vectors.