Tokenization — how AI reads text
Every word you type gets split into subword tokens before any language model sees it. This lesson explains why models can't just use raw text or whole words, and how tokenization algorithms like BPE and WordPiece balance vocabulary size, sequence length, and coverage.
The tokenization problem
Language models don't read raw text. They read integers. To convert text into integers, we split it into tokens — atomic units like words, subwords, or characters. Each token gets mapped to a unique integer (a token ID).
But what counts as a token? If we use whole words, we need a separate token for every unique word — millions of them. If we use characters, sequences become 5-10x longer, slowing down training. The answer: subword tokenization.
Subword algorithms like BPE and WordPiece find a middle ground. They identify the most common character and subword pairs, merge them iteratively, and create a vocabulary tailored to your data. The result: meaningful units, reasonable sequence lengths, and broad language coverage.
Why this matters
Tokenization is the first step every language model takes. The quality of tokenization affects the model's ability to learn semantic patterns, handle typos and rare words, and work across languages. Choosing the right tokenization strategy can meaningfully improve model performance.
Live tokenizer
Tokens (colored by unique ID)
Token count
43
Unique tokens
28
Compression
0.0%
Token IDs: 1 9 6 0 18 22 10 4 12 0 3 19 16 24 15 0 7 16 25 0 11 22 14 17 20 0 16 23 6 19 0 21 9 6 0 13 2 27 26 0 5 16 8
GPT-4 uses a vocabulary of roughly 100,000 tokens. On average, English text takes about 0.27 tokens per word — a single word like "hello" might be 1 token, while "anthropomorphic" might be 3. For comparison, character-level encoding would take 4-10 tokens per word, and word-level would take 1 token per word but require millions of vocabulary entries.
Code is denser: each token represents fewer characters, so code needs more tokens than natural text. This is why many models have separate tokenizers for code vs. language, or use larger vocabularies for multimodal models.
Token limits (like "4K context") are why shorter, denser prompts can fit more information: fewer tokens means more space for reasoning and output. This is why prompt engineering often focuses on token efficiency.
Quick check
Why is character-level tokenization not practical for large language models?
BPE — byte pair encoding
Byte Pair Encoding is one of the two most common subword tokenization algorithms. Training BPE is simple: count all adjacent character pairs in your corpus, merge the most frequent pair, count again, and repeat until you reach your target vocabulary size.
BPE merge step
Initial: Split into characters
0 / 6
Current tokens (36)
Adjacent pair frequencies
No pairs found
Once trained, BPE gives you a fixed list of merge rules. To tokenize new text, apply these merges in order. This guarantees consistent tokenization across any text, even text the model has never seen. BPE is used by GPT-2, GPT-3, and many modern language models.
BPE works on any language you give it, but it learns language-specific patterns. A model trained on English BPE will struggle with languages that have different morphology (like Turkish with agglutination, or Chinese with no spaces). Modern multilingual models either train on mixed corpora (learning a shared BPE) or use language-agnostic tokenizers like SentencePiece.
English is lucky: spaces separate words clearly, and the alphabet is small. Languages like Chinese have no word boundaries, so BPE must learn character-level patterns, which is less efficient. This is one reason why multilingual models often allocate more vocabulary to non-Latin scripts.
For truly multilingual coverage, SentencePiece (which treats text as raw bytes) often outperforms BPE, because it does not rely on pre-tokenization like whitespace splitting.
Quick check
In BPE training, how is the 'most frequent pair' selected?
WordPiece and SentencePiece
WordPiece improves on BPE by using likelihood scoring instead of raw frequency. Instead of merging the most frequent pair, WordPiece merges the pair that best increases the language model's likelihood on the corpus. It also marks subword continuations with ## prefix, helping the model distinguish word-internal structure.
WordPiece example
The ## prefix tells the model: "this token is a continuation of the previous word." This is especially useful for BERT and other masked-language models that predict masked tokens.
SentencePiece takes a different approach: it treats text as raw Unicode bytes, with no pre-tokenization step. It uses the Unigram language model to select the best segmentation at inference time. This makes SentencePiece truly language-agnostic — it works equally well on English, Chinese, Arabic, and code. Modern models like LLaMA and Mistral use SentencePiece.
For most purposes, BPE and WordPiece produce similar tokenizations. The likelihood-based scoring in WordPiece tends to produce slightly more semantically coherent subwords, especially for morphologically rich languages. BPE is simpler and more transparent (you can see exactly which pairs are being merged at each step).
The ## prefix convention in WordPiece matters most for token classification tasks (predicting a label for each token): without ##, the model cannot distinguish "unhappy" (one word) from "un happy" (two words). For pure language modeling (predicting the next token), the difference is smaller.
In practice, choose based on your model family: if you are using BERT, use WordPiece. If you are using GPT or building a new model, BPE is simpler. If you want maximum language coverage, use SentencePiece.
Quick check
What does the ## prefix in WordPiece tokenization mean?
See them side by side
Use the comparison tool below to see how Character-level, BPE, and WordPiece tokenize the same text differently. Try the presets or paste your own examples.
Compare tokenization approaches
Preset texts:
Character-level
Con: Verbose, loses semantics
BPE
compact
Con: Fixed vocab
WordPiece
##prefix marks
Con: Complex scoring
Key differences: BPE uses frequency-based merging and has no special markers. WordPiece uses likelihood scoring and marks subword continuations with ##. Character tokenization is simple but verbose. All three handle unknown words differently: Character always works, BPE may have OOV, WordPiece falls back to [UNK].
Vocabulary explorer
Tokens (25 of 25)
Token details
Note: Vocabulary frequencies are from a sample corpus. Real language models train on billions of tokens, resulting in different frequency distributions.
Quick check
Looking at the comparison above, why does BPE typically use fewer tokens than character-level?
Guided walkthrough: all approaches explained
Step through the lesson below to build intuition about each tokenization strategy, starting with why simple approaches fail and building up to state-of-the-art subword algorithms.
Why not just use words?
Word-level tokenization (splitting on whitespace) seems simple and intuitive. But it breaks down quickly with typos, compound words, rare words, or languages without clear word boundaries like Chinese or Japanese. Plus, your model needs a separate token for every unique word in existence.
Examples
unhappy
Treated as one OOV (unknown) token, even though "un" and "happy" have common meanings
ChatGPT
Proper nouns and neologisms become rare tokens, wasting vocabulary space
Pneumonoultramicroscopicsilicovolcanoconiosis
Long technical terms require their own rare token slots
Key insight:Word-level fails on morphology (compound words, typos) and requires an enormous vocabulary.
Key takeaway
Tokenization is how models convert text into integers. Character-level is universal but verbose, word-level is efficient but limited, and subword tokenization (BPE, WordPiece, SentencePiece) finds the middle ground. The algorithm you choose affects sequence length, vocabulary size, coverage, and ultimately model performance. Understanding tokenization helps you debug model behavior, optimize prompts, and design better training data.