Tokenizers Deep Dive
Train custom tokenizers and understand BPE, WordPiece, and Unigram algorithms
Tokenizers Deep Dive
TL;DR
Tokenizers convert raw text into numerical token IDs that models can process. Learn the three main algorithms (BPE, WordPiece, Unigram), train a custom tokenizer from scratch using the tokenizers library, and understand padding, truncation, and special tokens.
What You'll Learn
- BPE, WordPiece, and Unigram tokenization algorithms
- Training custom tokenizers from scratch
- Fast vs slow tokenizers and performance differences
- Padding, truncation, and attention mask strategies
- Vocabulary analysis and token distribution
- Integrating custom tokenizers with
transformers
Why Tokenization Is the Foundation
Every language model -- GPT-4, Llama, BERT -- sees text as a sequence of integer token IDs, not characters or words. Tokenization is the process that converts raw text into those IDs, and it directly affects model quality, speed, and cost. A poorly chosen vocabulary wastes context window on fragmented tokens (one word split into 5+ pieces), while a well-trained tokenizer keeps common phrases intact. Understanding tokenization is also essential for debugging: unexpected model behavior often traces back to how input was tokenized (e.g., numbers, code, non-English text).
| Property | Value |
|---|---|
| Difficulty | Beginner |
| Time | ~3 hours |
| Lines of Code | ~250 |
| Prerequisites | Basic Python, familiarity with NLP concepts |
Tech Stack
| Component | Technology | Why |
|---|---|---|
| Tokenizer Training | tokenizers (Rust-backed) | 10-20x faster than pure Python implementations |
| Integration | transformers | Seamless use of custom tokenizers with any model |
| Analysis | matplotlib, collections | Vocabulary statistics and token distribution visualization |
| Python | 3.10+ | Type hint support |
Architecture
Tokenizer Training Pipeline
Inference: text → normalize → pre-tokenize → model → post-process
Example: "Hello world!" → ["hello", "world", "!"] → [101, 7592, 2088, 999, 102]
Project Structure
tokenizers-deep-dive/
├── src/
│ ├── __init__.py
│ ├── algorithms.py # BPE, WordPiece, Unigram implementations
│ ├── train_tokenizer.py # Train custom tokenizer from scratch
│ ├── analysis.py # Vocabulary analysis and statistics
│ └── integration.py # Use custom tokenizer with transformers
├── data/
│ └── corpus.txt # Training corpus
├── tokenizers/ # Saved tokenizer files
├── examples/
│ └── compare_algorithms.py
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
tokenizers>=0.19.0
transformers>=4.40.0
matplotlib>=3.8.0
datasets>=2.19.0Step 2: Understanding the Algorithms
"""Demonstrate tokenization algorithms with educational examples."""
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers
def explain_bpe():
"""
Byte-Pair Encoding (BPE):
Starts with character-level tokens and iteratively merges
the most frequent adjacent pair into a new token.
Used by: GPT-2, GPT-3, GPT-4, LLaMA, Mistral
"""
# Create a BPE tokenizer
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
# Normalization: lowercase, strip accents
tokenizer.normalizer = normalizers.Sequence([
normalizers.NFD(),
normalizers.Lowercase(),
normalizers.StripAccents(),
])
# Pre-tokenization: split on whitespace
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
# BPE trainer configuration
trainer = trainers.BpeTrainer(
vocab_size=1000,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
show_progress=True,
)
return tokenizer, trainer
def explain_wordpiece():
"""
WordPiece:
Similar to BPE but uses likelihood maximization instead of
frequency. Merges the pair that maximizes the language model
likelihood of the training data.
The ## prefix indicates a subword continuation.
Used by: BERT, DistilBERT, Electra
"""
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Sequence([
normalizers.NFD(),
normalizers.Lowercase(),
normalizers.StripAccents(),
])
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.WordPieceTrainer(
vocab_size=1000,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##",
)
return tokenizer, trainer
def explain_unigram():
"""
Unigram (SentencePiece):
Starts with a large vocabulary and iteratively removes tokens
that least decrease the overall likelihood. Uses the EM algorithm
to find the optimal tokenization.
The ▁ prefix indicates a word boundary.
Used by: T5, ALBERT, XLNet, mBART
"""
tokenizer = Tokenizer(models.Unigram())
tokenizer.normalizer = normalizers.Sequence([
normalizers.Nmt(),
normalizers.NFKC(),
])
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
trainer = trainers.UnigramTrainer(
vocab_size=1000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
unk_token="[UNK]",
)
return tokenizer, trainerHow BPE Merging Works:
BPE Merge Process (simplified)
Step 0: Character level
Corpus: ["low", "lower", "newest", "widest"] → l o w l o w e r n e w e s t w i d e s t
Step 1: Merge (e, s)
Most frequent pair → merge into "es": l o w l o w e r n e w es t w i d es t
Step 2: Merge (es, t)
Most frequent pair → merge into "est": l o w l o w e r n e w est w i d est
Step 3: Merge (l, o)
Most frequent pair → merge into "lo": lo w lo w e r n e w est w i d est
Step 4: Merge (lo, w)
Most frequent pair → merge into "low": low low e r n e w est w i d est
Final vocabulary includes both characters and merged tokens.
Algorithm Comparison:
| Algorithm | Merge Strategy | Subword Prefix | Used By |
|---|---|---|---|
| BPE | Most frequent pair | None (byte-level) | GPT-2/3/4, LLaMA, Mistral |
| WordPiece | Max likelihood pair | ## continuation | BERT, DistilBERT |
| Unigram | Remove least useful | ▁ word boundary | T5, ALBERT, XLNet |
Step 3: Train a Custom Tokenizer
"""Train a custom tokenizer from scratch."""
from tokenizers import (
Tokenizer,
models,
trainers,
pre_tokenizers,
normalizers,
processors,
decoders,
)
from pathlib import Path
def train_bpe_tokenizer(
corpus_files: list[str],
vocab_size: int = 30000,
min_frequency: int = 2,
output_dir: str = "tokenizers",
) -> Tokenizer:
"""
Train a BPE tokenizer from scratch.
Args:
corpus_files: List of text file paths for training
vocab_size: Target vocabulary size
min_frequency: Minimum token frequency to keep
output_dir: Directory to save the trained tokenizer
"""
# Initialize with BPE model
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
# Normalization pipeline
tokenizer.normalizer = normalizers.Sequence([
normalizers.NFD(), # Unicode decomposition
normalizers.StripAccents(), # Remove diacritics
normalizers.Lowercase(), # Lowercase all text
])
# Pre-tokenization: split on whitespace and punctuation
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
add_prefix_space=False
)
# Decoder (reverses pre-tokenization for readable output)
tokenizer.decoder = decoders.ByteLevel()
# Define special tokens
special_tokens = ["<unk>", "<s>", "</s>", "<pad>", "<mask>"]
# Configure trainer
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
min_frequency=min_frequency,
special_tokens=special_tokens,
show_progress=True,
initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)
# Train on corpus files
tokenizer.train(corpus_files, trainer)
# Post-processing: add special tokens around sequences
tokenizer.post_processor = processors.TemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> $B:1 </s>:1",
special_tokens=[
("<s>", tokenizer.token_to_id("<s>")),
("</s>", tokenizer.token_to_id("</s>")),
],
)
# Save tokenizer
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
tokenizer.save(str(output_path / "custom-bpe.json"))
print(f"Tokenizer trained with vocab size: {tokenizer.get_vocab_size()}")
return tokenizer
def train_from_datasets(
dataset_name: str = "wikitext",
dataset_config: str = "wikitext-2-raw-v1",
vocab_size: int = 30000,
) -> Tokenizer:
"""Train a tokenizer directly from a HuggingFace dataset."""
from datasets import load_dataset
dataset = load_dataset(dataset_name, dataset_config, split="train")
# Create an iterator over the text
def batch_iterator(batch_size=1000):
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
special_tokens=["<unk>", "<s>", "</s>", "<pad>"],
initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)
# Train from iterator (memory-efficient for large datasets)
tokenizer.train_from_iterator(batch_iterator(), trainer)
return tokenizer
def demonstrate_encoding(tokenizer: Tokenizer, texts: list[str]):
"""Show encoding results for a list of texts."""
for text in texts:
encoding = tokenizer.encode(text)
print(f"\nText: {text}")
print(f" Tokens: {encoding.tokens}")
print(f" IDs: {encoding.ids}")
print(f" Length: {len(encoding.ids)}")Understanding the Training Code:
The train_bpe_tokenizer function assembles a full tokenizer from four composable stages. First, the normalizer chain applies NFD unicode decomposition, strips accents, and lowercases -- this ensures that "Cafe" and "cafe" share tokens. The ByteLevel pre-tokenizer maps every byte to a visible Unicode character (so nothing is out-of-vocabulary, even binary data). The BpeTrainer then iteratively merges the most frequent byte pairs until reaching vocab_size. Finally, TemplateProcessing wraps every input with <s> and </s> tokens, which is required for models like RoBERTa. The train_from_datasets variant shows a memory-efficient pattern: instead of writing text to files first, it feeds a Python iterator directly to train_from_iterator().
Tokenizer Training Pipeline:
Training Pipeline Components
Step 4: Padding and Truncation Strategies
"""Tokenizer analysis and padding/truncation strategies."""
from transformers import AutoTokenizer
from collections import Counter
def demonstrate_padding_strategies(model_name: str = "bert-base-uncased"):
"""Show different padding and truncation strategies."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
texts = [
"Short text.",
"This is a medium length sentence for testing.",
"This is a longer sentence that contains more tokens and will help "
"demonstrate how truncation works with different strategies.",
]
# Strategy 1: Pad to longest in batch
batch_longest = tokenizer(
texts,
padding="longest",
return_tensors="pt",
)
print("=== padding='longest' ===")
print(f" Shape: {batch_longest['input_ids'].shape}")
# Strategy 2: Pad to max model length
batch_max = tokenizer(
texts,
padding="max_length",
max_length=32,
truncation=True,
return_tensors="pt",
)
print(f"\n=== padding='max_length', max_length=32 ===")
print(f" Shape: {batch_max['input_ids'].shape}")
# Strategy 3: No padding (returns lists)
batch_none = tokenizer(texts, padding=False)
print(f"\n=== padding=False ===")
for i, ids in enumerate(batch_none["input_ids"]):
print(f" Text {i}: {len(ids)} tokens")
return batch_longest, batch_max, batch_none
def analyze_vocabulary(model_name: str = "bert-base-uncased"):
"""Analyze a tokenizer's vocabulary statistics."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
vocab = tokenizer.get_vocab()
# Basic stats
print(f"Model: {model_name}")
print(f"Vocab size: {len(vocab)}")
print(f"Max token length: {max(len(t) for t in vocab.keys())}")
# Token type distribution
subword_count = sum(1 for t in vocab if t.startswith("##"))
special_count = sum(1 for t in vocab if t.startswith("[") and t.endswith("]"))
regular_count = len(vocab) - subword_count - special_count
print(f"\nToken types:")
print(f" Regular words: {regular_count:>6,}")
print(f" Subwords (##): {subword_count:>6,}")
print(f" Special tokens: {special_count:>6,}")
# Character coverage
chars = set()
for token in vocab:
chars.update(token.replace("##", ""))
print(f" Unique chars: {len(chars):>6,}")
return vocab
def compare_tokenizations(text: str, model_names: list[str]):
"""Compare how different models tokenize the same text."""
print(f"Text: \"{text}\"\n")
for model_name in model_names:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(f" {model_name}:")
print(f" Tokens ({len(tokens)}): {tokens}")
print(f" IDs: {ids}")
print()
def token_frequency_analysis(
texts: list[str],
model_name: str = "bert-base-uncased",
) -> dict:
"""Analyze token frequency distribution in a corpus."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
token_counts = Counter()
total_tokens = 0
for text in texts:
tokens = tokenizer.tokenize(text)
token_counts.update(tokens)
total_tokens += len(tokens)
# Summary statistics
print(f"Total tokens: {total_tokens:,}")
print(f"Unique tokens: {len(token_counts):,}")
print(f"Avg tokens per text: {total_tokens / len(texts):.1f}")
print(f"\nTop 20 tokens:")
for token, count in token_counts.most_common(20):
pct = count / total_tokens * 100
print(f" {token:20s} {count:>8,} ({pct:.1f}%)")
return dict(token_counts)Padding and Truncation Explained:
Padding & Truncation Strategies
padding="longest"
Recommendedpadding="max_length"
padding=False
Step 5: Integration with Transformers
"""Integrate custom tokenizer with the transformers library."""
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
from tokenizers import Tokenizer
def wrap_for_transformers(
tokenizer_path: str,
model_max_length: int = 512,
) -> PreTrainedTokenizerFast:
"""
Wrap a tokenizers.Tokenizer as a transformers-compatible tokenizer.
This allows using a custom-trained tokenizer with any
transformers model, trainer, or pipeline.
"""
# Load the raw tokenizer
raw_tokenizer = Tokenizer.from_file(tokenizer_path)
# Wrap it for transformers compatibility
wrapped = PreTrainedTokenizerFast(
tokenizer_object=raw_tokenizer,
unk_token="<unk>",
bos_token="<s>",
eos_token="</s>",
pad_token="<pad>",
mask_token="<mask>",
model_max_length=model_max_length,
)
return wrapped
def compare_fast_vs_slow():
"""
Compare fast (Rust) vs slow (Python) tokenizers.
Fast tokenizers (PreTrainedTokenizerFast):
- Written in Rust via the `tokenizers` library
- 10-20x faster for batch encoding
- Support offset mapping (char-to-token alignment)
- Return BatchEncoding with additional methods
Slow tokenizers (PreTrainedTokenizer):
- Written in pure Python
- Easier to debug and modify
- Same API but fewer features
"""
from transformers import BertTokenizer, BertTokenizerFast
import time
texts = ["This is a test sentence."] * 1000
# Slow tokenizer
slow = BertTokenizer.from_pretrained("bert-base-uncased")
start = time.perf_counter()
slow(texts, padding=True, truncation=True)
slow_time = time.perf_counter() - start
# Fast tokenizer
fast = BertTokenizerFast.from_pretrained("bert-base-uncased")
start = time.perf_counter()
fast(texts, padding=True, truncation=True)
fast_time = time.perf_counter() - start
print(f"Slow tokenizer: {slow_time:.3f}s")
print(f"Fast tokenizer: {fast_time:.3f}s")
print(f"Speedup: {slow_time / fast_time:.1f}x")
def offset_mapping_demo():
"""
Demonstrate offset mappings for character-level alignment.
Offset mappings tell you which characters in the original text
correspond to each token — essential for NER and span extraction.
"""
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "HuggingFace is creating amazing NLP tools"
encoding = tokenizer(
text,
return_offsets_mapping=True,
)
print(f"Text: {text}\n")
print(f"{'Token':<15} {'ID':<8} {'Start':<6} {'End':<6} {'Original'}")
print("-" * 55)
for token, token_id, (start, end) in zip(
encoding.tokens(),
encoding["input_ids"],
encoding["offset_mapping"],
):
original = text[start:end] if start != end else "[special]"
print(f"{token:<15} {token_id:<8} {start:<6} {end:<6} {original}")Fast vs Slow Tokenizers:
| Feature | Fast (Rust) | Slow (Python) |
|---|---|---|
| Speed | 10-20x faster | Baseline |
| Offset mapping | Yes | No |
| Batch encoding | Parallelized | Sequential |
| Custom modification | Harder | Easier |
| Library | tokenizers | Pure Python |
| Default in transformers | Yes (since v4.0) | Legacy |
Step 6: Comparison Demo
"""Compare tokenization algorithms side by side."""
from src.algorithms import explain_bpe, explain_wordpiece, explain_unigram
from src.analysis import compare_tokenizations
def main():
# Compare how different models tokenize the same text
compare_tokenizations(
text="The quick brown fox jumped over the lazy dog.",
model_names=[
"bert-base-uncased", # WordPiece
"gpt2", # BPE (byte-level)
"t5-small", # Unigram (SentencePiece)
],
)
# Interesting edge cases
print("=== Edge Cases ===\n")
compare_tokenizations(
text="transformers is 10x faster than tensorflow",
model_names=["bert-base-uncased", "gpt2"],
)
compare_tokenizations(
text="pneumonoultramicroscopicsilicovolcanoconiosis",
model_names=["bert-base-uncased", "gpt2"],
)
if __name__ == "__main__":
main()Running the Project
# Install dependencies
pip install -r requirements.txt
# Train a custom BPE tokenizer
python -c "
from src.train_tokenizer import train_from_datasets
tok = train_from_datasets(vocab_size=10000)
tok.save('tokenizers/custom-bpe.json')
"
# Compare algorithms
python examples/compare_algorithms.py
# Analyze BERT vocabulary
python -c "
from src.analysis import analyze_vocabulary
analyze_vocabulary('bert-base-uncased')
"
# Fast vs slow benchmark
python -c "
from src.integration import compare_fast_vs_slow
compare_fast_vs_slow()
"Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| BPE | Byte-Pair Encoding — merge most frequent pairs | Standard for modern LLMs (GPT, LLaMA) |
| WordPiece | Merge pairs that maximize likelihood | Used by BERT family; ## prefix for subwords |
| Unigram | Remove tokens that least reduce likelihood | Used by T5, ALBERT; ▁ prefix for word boundaries |
| Normalizer | Clean text before tokenization | Consistent handling of unicode, case, accents |
| Pre-tokenizer | Initial text splitting | Controls word boundary detection |
| Offset Mapping | Character-to-token alignment | Essential for NER, span extraction, highlighting |
| Fast Tokenizer | Rust-backed tokenizer | 10-20x faster than Python; supports offsets |
| Special Tokens | [CLS], [SEP], [PAD], [MASK] | Control model behavior for different tasks |
Next Steps
- Datasets Mastery — Load and process training data at scale
- Fine-Tuning with PEFT — Fine-tune models with your custom tokenizer