Vamsi Krishna Sankarayogi — Technologist at Heart

The tokenization problem

Language models don't read raw text. They read integers. To convert text into integers, we split it into tokens — atomic units like words, subwords, or characters. Each token gets mapped to a unique integer (a token ID).

But what counts as a token? If we use whole words, we need a separate token for every unique word — millions of them. If we use characters, sequences become 5-10x longer, slowing down training. The answer: subword tokenization.

Subword algorithms like BPE and WordPiece find a middle ground. They identify the most common character and subword pairs, merge them iteratively, and create a vocabulary tailored to your data. The result: meaningful units, reasonable sequence lengths, and broad language coverage.

Why this matters

Tokenization is the first step every language model takes. The quality of tokenization affects the model's ability to learn semantic patterns, handle typos and rare words, and work across languages. Choosing the right tokenization strategy can meaningfully improve model performance.

Live tokenizer

Enter any text:

Tokenization method:

Tokens (colored by unique ID)

The quick brown fox jumps over the lazy dog

Token count

43

Unique tokens

28

Compression

0.0%

Token IDs: 1 9 6 0 18 22 10 4 12 0 3 19 16 24 15 0 7 16 25 0 11 22 14 17 20 0 16 23 6 19 0 21 9 6 0 13 2 27 26 0 5 16 8

GPT-4 uses a vocabulary of roughly 100,000 tokens. On average, English text takes about 0.27 tokens per word — a single word like "hello" might be 1 token, while "anthropomorphic" might be 3. For comparison, character-level encoding would take 4-10 tokens per word, and word-level would take 1 token per word but require millions of vocabulary entries.

Code is denser: each token represents fewer characters, so code needs more tokens than natural text. This is why many models have separate tokenizers for code vs. language, or use larger vocabularies for multimodal models.

Token limits (like "4K context") are why shorter, denser prompts can fit more information: fewer tokens means more space for reasoning and output. This is why prompt engineering often focuses on token efficiency.

?

Quick check

Why is character-level tokenization not practical for large language models?

BPE — byte pair encoding

Byte Pair Encoding is one of the two most common subword tokenization algorithms. Training BPE is simple: count all adjacent character pairs in your corpus, merge the most frequent pair, count again, and repeat until you reach your target vocabulary size.

BPE merge step

Initial: Split into characters

0 / 6

Current tokens (36)

hello worldhello thereworld is great

Adjacent pair frequencies

No pairs found

Once trained, BPE gives you a fixed list of merge rules. To tokenize new text, apply these merges in order. This guarantees consistent tokenization across any text, even text the model has never seen. BPE is used by GPT-2, GPT-3, and many modern language models.

BPE works on any language you give it, but it learns language-specific patterns. A model trained on English BPE will struggle with languages that have different morphology (like Turkish with agglutination, or Chinese with no spaces). Modern multilingual models either train on mixed corpora (learning a shared BPE) or use language-agnostic tokenizers like SentencePiece.

English is lucky: spaces separate words clearly, and the alphabet is small. Languages like Chinese have no word boundaries, so BPE must learn character-level patterns, which is less efficient. This is one reason why multilingual models often allocate more vocabulary to non-Latin scripts.

For truly multilingual coverage, SentencePiece (which treats text as raw bytes) often outperforms BPE, because it does not rely on pre-tokenization like whitespace splitting.

?

Quick check

In BPE training, how is the 'most frequent pair' selected?

WordPiece and SentencePiece

WordPiece improves on BPE by using likelihood scoring instead of raw frequency. Instead of merging the most frequent pair, WordPiece merges the pair that best increases the language model's likelihood on the corpus. It also marks subword continuations with ## prefix, helping the model distinguish word-internal structure.

WordPiece example

unhappy → ["un", "##happy"]

running → ["run", "##ning"]

internship → ["inter", "##nship"] or ["in", "##tern", "##ship"]

The ## prefix tells the model: "this token is a continuation of the previous word." This is especially useful for BERT and other masked-language models that predict masked tokens.

SentencePiece takes a different approach: it treats text as raw Unicode bytes, with no pre-tokenization step. It uses the Unigram language model to select the best segmentation at inference time. This makes SentencePiece truly language-agnostic — it works equally well on English, Chinese, Arabic, and code. Modern models like LLaMA and Mistral use SentencePiece.

For most purposes, BPE and WordPiece produce similar tokenizations. The likelihood-based scoring in WordPiece tends to produce slightly more semantically coherent subwords, especially for morphologically rich languages. BPE is simpler and more transparent (you can see exactly which pairs are being merged at each step).

The ## prefix convention in WordPiece matters most for token classification tasks (predicting a label for each token): without ##, the model cannot distinguish "unhappy" (one word) from "un happy" (two words). For pure language modeling (predicting the next token), the difference is smaller.

In practice, choose based on your model family: if you are using BERT, use WordPiece. If you are using GPT or building a new model, BPE is simpler. If you want maximum language coverage, use SentencePiece.

?

Quick check

What does the ## prefix in WordPiece tokenization mean?

See them side by side

Use the comparison tool below to see how Character-level, BPE, and WordPiece tokenize the same text differently. Try the presets or paste your own examples.

Compare tokenization approaches

Preset texts:

Character-level

unhappiness

Tokens:11

Vocab:8

Pro: Universal, no OOV
Con: Verbose, loses semantics

BPE

unhappiness

Tokens:11

Vocab:8

Pro: Frequency-based,
compact
Con: Fixed vocab

WordPiece

u##[UNK]##[UNK]##[UNK]##[UNK]##[UNK]##[UNK]##[UNK]##[UNK]##[UNK]##[UNK]

Tokens:11

Vocab:2

Pro: Likelihood-based,
##prefix marks
Con: Complex scoring

Key differences: BPE uses frequency-based merging and has no special markers. WordPiece uses likelihood scoring and marks subword continuations with ##. Character tokenization is simple but verbose. All three handle unknown words differently: Character always works, BPE may have OOV, WordPiece falls back to [UNK].

Vocabulary explorer

Single char

Subword

Full word

Tokens (25 of 25)

Token details

Click a token to see details

Note: Vocabulary frequencies are from a sample corpus. Real language models train on billions of tokens, resulting in different frequency distributions.

?

Quick check

Looking at the comparison above, why does BPE typically use fewer tokens than character-level?

Guided walkthrough: all approaches explained

Step through the lesson below to build intuition about each tokenization strategy, starting with why simple approaches fail and building up to state-of-the-art subword algorithms.

Why not just use words?

Word-level tokenization (splitting on whitespace) seems simple and intuitive. But it breaks down quickly with typos, compound words, rare words, or languages without clear word boundaries like Chinese or Japanese. Plus, your model needs a separate token for every unique word in existence.

Examples

unhappy

Treated as one OOV (unknown) token, even though "un" and "happy" have common meanings

ChatGPT

Proper nouns and neologisms become rare tokens, wasting vocabulary space

Pneumonoultramicroscopicsilicovolcanoconiosis

Long technical terms require their own rare token slots

Key insight:Word-level fails on morphology (compound words, typos) and requires an enormous vocabulary.

Key takeaway

Tokenization is how models convert text into integers. Character-level is universal but verbose, word-level is efficient but limited, and subword tokenization (BPE, WordPiece, SentencePiece) finds the middle ground. The algorithm you choose affects sequence length, vocabulary size, coverage, and ultimately model performance. Understanding tokenization helps you debug model behavior, optimize prompts, and design better training data.

Tokenization — how AI reads text

The tokenization problem

Live tokenizer

BPE — byte pair encoding

Initial: Split into characters

WordPiece and SentencePiece

See them side by side

Compare tokenization approaches

Character-level

BPE

WordPiece

Vocabulary explorer

Guided walkthrough: all approaches explained

Why not just use words?

Finished this lesson?