HuggingFace EcosystemBeginner
Tokenizers Deep Dive
Train custom tokenizers and understand BPE, WordPiece, and Unigram algorithms
Tokenizers Deep Dive
TL;DR
Tokenizers convert raw text into numerical token IDs that models can process. Learn the three main algorithms (BPE, WordPiece, Unigram), train a custom tokenizer from scratch using the tokenizers library, and understand padding, truncation, and special tokens.
Understand how tokenization works at every level, then train your own custom tokenizer from scratch using HuggingFace's Rust-backed tokenizers library.
What You'll Learn
- BPE, WordPiece, and Unigram tokenization algorithms
- Training custom tokenizers from scratch
- Fast vs slow tokenizers and performance differences
- Padding, truncation, and attention mask strategies
- Vocabulary analysis and token distribution
- Integrating custom tokenizers with
transformers
Tech Stack
| Component | Technology |
|---|---|
| Tokenizer Training | tokenizers (Rust-backed) |
| Integration | transformers |
| Analysis | matplotlib, collections |
| Python | 3.10+ |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ TOKENIZER TRAINING PIPELINE │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Raw Text │──▶│ Normalizer │──▶│ Pre-tokenizer│──▶│ Model Training │ │
│ │ Corpus │ │ (lowercase, │ │ (whitespace, │ │ (BPE/WordPiece/ │ │
│ │ │ │ unicode) │ │ punctuation)│ │ Unigram) │ │
│ └──────────┘ └──────────────┘ └──────────────┘ └───────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Trained Tokenizer │ │
│ │ ┌──────────────────────┐ │ │
│ │ │ Vocabulary (30K IDs) │ │ │
│ │ │ Merge rules │ │ │
│ │ │ Special tokens │ │ │
│ │ └──────────────────────┘ │ │
│ └──────────────────────────┘ │
│ │
│ INFERENCE: text ──► normalize ──► pre-tokenize ──► model ──► post-process │
│ "Hello world!" → ["hello", "world", "!"] → [101, 7592, 2088, 999, 102]
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
tokenizers-deep-dive/
├── src/
│ ├── __init__.py
│ ├── algorithms.py # BPE, WordPiece, Unigram implementations
│ ├── train_tokenizer.py # Train custom tokenizer from scratch
│ ├── analysis.py # Vocabulary analysis and statistics
│ └── integration.py # Use custom tokenizer with transformers
├── data/
│ └── corpus.txt # Training corpus
├── tokenizers/ # Saved tokenizer files
├── examples/
│ └── compare_algorithms.py
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
tokenizers>=0.19.0
transformers>=4.40.0
matplotlib>=3.8.0
datasets>=2.19.0Step 2: Understanding the Algorithms
"""Demonstrate tokenization algorithms with educational examples."""
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers
def explain_bpe():
"""
Byte-Pair Encoding (BPE):
Starts with character-level tokens and iteratively merges
the most frequent adjacent pair into a new token.
Used by: GPT-2, GPT-3, GPT-4, LLaMA, Mistral
"""
# Create a BPE tokenizer
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
# Normalization: lowercase, strip accents
tokenizer.normalizer = normalizers.Sequence([
normalizers.NFD(),
normalizers.Lowercase(),
normalizers.StripAccents(),
])
# Pre-tokenization: split on whitespace
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
# BPE trainer configuration
trainer = trainers.BpeTrainer(
vocab_size=1000,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
show_progress=True,
)
return tokenizer, trainer
def explain_wordpiece():
"""
WordPiece:
Similar to BPE but uses likelihood maximization instead of
frequency. Merges the pair that maximizes the language model
likelihood of the training data.
The ## prefix indicates a subword continuation.
Used by: BERT, DistilBERT, Electra
"""
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Sequence([
normalizers.NFD(),
normalizers.Lowercase(),
normalizers.StripAccents(),
])
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.WordPieceTrainer(
vocab_size=1000,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##",
)
return tokenizer, trainer
def explain_unigram():
"""
Unigram (SentencePiece):
Starts with a large vocabulary and iteratively removes tokens
that least decrease the overall likelihood. Uses the EM algorithm
to find the optimal tokenization.
The ▁ prefix indicates a word boundary.
Used by: T5, ALBERT, XLNet, mBART
"""
tokenizer = Tokenizer(models.Unigram())
tokenizer.normalizer = normalizers.Sequence([
normalizers.Nmt(),
normalizers.NFKC(),
])
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
trainer = trainers.UnigramTrainer(
vocab_size=1000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
unk_token="[UNK]",
)
return tokenizer, trainerHow BPE Merging Works:
┌─────────────────────────────────────────────────────────────────┐
│ BPE MERGE PROCESS (simplified) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Corpus: ["low", "lower", "newest", "widest"] │
│ │
│ Step 0 — Character level: │
│ l o w l o w e r n e w e s t w i d e s t │
│ │
│ Step 1 — Most frequent pair: (e, s) → merge into "es" │
│ l o w l o w e r n e w es t w i d es t │
│ │
│ Step 2 — Most frequent pair: (es, t) → merge into "est" │
│ l o w l o w e r n e w est w i d est │
│ │
│ Step 3 — Most frequent pair: (l, o) → merge into "lo" │
│ lo w lo w e r n e w est w i d est │
│ │
│ Step 4 — Most frequent pair: (lo, w) → merge into "low" │
│ low low e r n e w est w i d est │
│ │
│ ...continue until vocab_size reached │
│ │
│ Final vocabulary includes both characters and merged tokens. │
│ │
└─────────────────────────────────────────────────────────────────┘Algorithm Comparison:
| Algorithm | Merge Strategy | Subword Prefix | Used By |
|---|---|---|---|
| BPE | Most frequent pair | None (byte-level) | GPT-2/3/4, LLaMA, Mistral |
| WordPiece | Max likelihood pair | ## continuation | BERT, DistilBERT |
| Unigram | Remove least useful | ▁ word boundary | T5, ALBERT, XLNet |
Step 3: Train a Custom Tokenizer
"""Train a custom tokenizer from scratch."""
from tokenizers import (
Tokenizer,
models,
trainers,
pre_tokenizers,
normalizers,
processors,
decoders,
)
from pathlib import Path
def train_bpe_tokenizer(
corpus_files: list[str],
vocab_size: int = 30000,
min_frequency: int = 2,
output_dir: str = "tokenizers",
) -> Tokenizer:
"""
Train a BPE tokenizer from scratch.
Args:
corpus_files: List of text file paths for training
vocab_size: Target vocabulary size
min_frequency: Minimum token frequency to keep
output_dir: Directory to save the trained tokenizer
"""
# Initialize with BPE model
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
# Normalization pipeline
tokenizer.normalizer = normalizers.Sequence([
normalizers.NFD(), # Unicode decomposition
normalizers.StripAccents(), # Remove diacritics
normalizers.Lowercase(), # Lowercase all text
])
# Pre-tokenization: split on whitespace and punctuation
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
add_prefix_space=False
)
# Decoder (reverses pre-tokenization for readable output)
tokenizer.decoder = decoders.ByteLevel()
# Define special tokens
special_tokens = ["<unk>", "<s>", "</s>", "<pad>", "<mask>"]
# Configure trainer
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
min_frequency=min_frequency,
special_tokens=special_tokens,
show_progress=True,
initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)
# Train on corpus files
tokenizer.train(corpus_files, trainer)
# Post-processing: add special tokens around sequences
tokenizer.post_processor = processors.TemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> $B:1 </s>:1",
special_tokens=[
("<s>", tokenizer.token_to_id("<s>")),
("</s>", tokenizer.token_to_id("</s>")),
],
)
# Save tokenizer
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
tokenizer.save(str(output_path / "custom-bpe.json"))
print(f"Tokenizer trained with vocab size: {tokenizer.get_vocab_size()}")
return tokenizer
def train_from_datasets(
dataset_name: str = "wikitext",
dataset_config: str = "wikitext-2-raw-v1",
vocab_size: int = 30000,
) -> Tokenizer:
"""Train a tokenizer directly from a HuggingFace dataset."""
from datasets import load_dataset
dataset = load_dataset(dataset_name, dataset_config, split="train")
# Create an iterator over the text
def batch_iterator(batch_size=1000):
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
special_tokens=["<unk>", "<s>", "</s>", "<pad>"],
initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)
# Train from iterator (memory-efficient for large datasets)
tokenizer.train_from_iterator(batch_iterator(), trainer)
return tokenizer
def demonstrate_encoding(tokenizer: Tokenizer, texts: list[str]):
"""Show encoding results for a list of texts."""
for text in texts:
encoding = tokenizer.encode(text)
print(f"\nText: {text}")
print(f" Tokens: {encoding.tokens}")
print(f" IDs: {encoding.ids}")
print(f" Length: {len(encoding.ids)}")Tokenizer Training Pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ TRAINING PIPELINE COMPONENTS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. NORMALIZER — Clean the input text │
│ "Café résumé" → "cafe resume" │
│ ┌──────────┬──────────────────────────────────────┐ │
│ │ NFD │ Unicode decomposition (split accents)│ │
│ │ Strip │ Remove accent marks │ │
│ │ Lower │ "Hello" → "hello" │ │
│ │ NFKC │ Compatibility normalization │ │
│ └──────────┴──────────────────────────────────────┘ │
│ │
│ 2. PRE-TOKENIZER — Split into initial chunks │
│ "hello world!" → ["hello", "world", "!"] │
│ ┌───────────┬─────────────────────────────────────┐ │
│ │ Whitespace│ Split on spaces │ │
│ │ ByteLevel │ UTF-8 byte representation │ │
│ │ Metaspace │ Prefix word boundaries with ▁ │ │
│ └───────────┴─────────────────────────────────────┘ │
│ │
│ 3. MODEL — Apply the tokenization algorithm │
│ ["hello"] → ["hel", "lo"] (subword split) │
│ │
│ 4. POST-PROCESSOR — Add special tokens │
│ ["hel", "lo"] → ["<s>", "hel", "lo", "</s>"] │
│ │
└─────────────────────────────────────────────────────────────────┘Step 4: Padding and Truncation Strategies
"""Tokenizer analysis and padding/truncation strategies."""
from transformers import AutoTokenizer
from collections import Counter
def demonstrate_padding_strategies(model_name: str = "bert-base-uncased"):
"""Show different padding and truncation strategies."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
texts = [
"Short text.",
"This is a medium length sentence for testing.",
"This is a longer sentence that contains more tokens and will help "
"demonstrate how truncation works with different strategies.",
]
# Strategy 1: Pad to longest in batch
batch_longest = tokenizer(
texts,
padding="longest",
return_tensors="pt",
)
print("=== padding='longest' ===")
print(f" Shape: {batch_longest['input_ids'].shape}")
# Strategy 2: Pad to max model length
batch_max = tokenizer(
texts,
padding="max_length",
max_length=32,
truncation=True,
return_tensors="pt",
)
print(f"\n=== padding='max_length', max_length=32 ===")
print(f" Shape: {batch_max['input_ids'].shape}")
# Strategy 3: No padding (returns lists)
batch_none = tokenizer(texts, padding=False)
print(f"\n=== padding=False ===")
for i, ids in enumerate(batch_none["input_ids"]):
print(f" Text {i}: {len(ids)} tokens")
return batch_longest, batch_max, batch_none
def analyze_vocabulary(model_name: str = "bert-base-uncased"):
"""Analyze a tokenizer's vocabulary statistics."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
vocab = tokenizer.get_vocab()
# Basic stats
print(f"Model: {model_name}")
print(f"Vocab size: {len(vocab)}")
print(f"Max token length: {max(len(t) for t in vocab.keys())}")
# Token type distribution
subword_count = sum(1 for t in vocab if t.startswith("##"))
special_count = sum(1 for t in vocab if t.startswith("[") and t.endswith("]"))
regular_count = len(vocab) - subword_count - special_count
print(f"\nToken types:")
print(f" Regular words: {regular_count:>6,}")
print(f" Subwords (##): {subword_count:>6,}")
print(f" Special tokens: {special_count:>6,}")
# Character coverage
chars = set()
for token in vocab:
chars.update(token.replace("##", ""))
print(f" Unique chars: {len(chars):>6,}")
return vocab
def compare_tokenizations(text: str, model_names: list[str]):
"""Compare how different models tokenize the same text."""
print(f"Text: \"{text}\"\n")
for model_name in model_names:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(f" {model_name}:")
print(f" Tokens ({len(tokens)}): {tokens}")
print(f" IDs: {ids}")
print()
def token_frequency_analysis(
texts: list[str],
model_name: str = "bert-base-uncased",
) -> dict:
"""Analyze token frequency distribution in a corpus."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
token_counts = Counter()
total_tokens = 0
for text in texts:
tokens = tokenizer.tokenize(text)
token_counts.update(tokens)
total_tokens += len(tokens)
# Summary statistics
print(f"Total tokens: {total_tokens:,}")
print(f"Unique tokens: {len(token_counts):,}")
print(f"Avg tokens per text: {total_tokens / len(texts):.1f}")
print(f"\nTop 20 tokens:")
for token, count in token_counts.most_common(20):
pct = count / total_tokens * 100
print(f" {token:20s} {count:>8,} ({pct:.1f}%)")
return dict(token_counts)Padding and Truncation Explained:
┌─────────────────────────────────────────────────────────────────┐
│ PADDING & TRUNCATION STRATEGIES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input: "Short" = 3 tokens │
│ "Medium text" = 5 tokens │
│ "A longer one" = 7 tokens │
│ │
│ padding="longest" (pad to longest in batch): │
│ ┌───┬───┬───┬───┬───┬───┬───┐ │
│ │ S │ h │ o │PAD│PAD│PAD│PAD│ 3 real + 4 padding │
│ │ M │ e │ d │ i │ u │PAD│PAD│ 5 real + 2 padding │
│ │ A │ l │ o │ n │ g │ e │ r │ 7 real + 0 padding │
│ └───┴───┴───┴───┴───┴───┴───┘ │
│ Attention mask: 1=real, 0=padding (model ignores padding) │
│ │
│ padding="max_length", max_length=5, truncation=True: │
│ ┌───┬───┬───┬───┬───┐ │
│ │ S │ h │ o │PAD│PAD│ padded to max_length │
│ │ M │ e │ d │ i │ u │ exact fit │
│ │ A │ l │ o │ n │ g │ TRUNCATED (lost "e" and "r") │
│ └───┴───┴───┴───┴───┘ │
│ │
│ When to use which: │
│ • "longest" → Dynamic batching, saves memory │
│ • "max_length" → Fixed shapes, needed for some hardware (TPU) │
│ • False → Custom collation in DataLoader │
│ │
└─────────────────────────────────────────────────────────────────┘Step 5: Integration with Transformers
"""Integrate custom tokenizer with the transformers library."""
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
from tokenizers import Tokenizer
def wrap_for_transformers(
tokenizer_path: str,
model_max_length: int = 512,
) -> PreTrainedTokenizerFast:
"""
Wrap a tokenizers.Tokenizer as a transformers-compatible tokenizer.
This allows using a custom-trained tokenizer with any
transformers model, trainer, or pipeline.
"""
# Load the raw tokenizer
raw_tokenizer = Tokenizer.from_file(tokenizer_path)
# Wrap it for transformers compatibility
wrapped = PreTrainedTokenizerFast(
tokenizer_object=raw_tokenizer,
unk_token="<unk>",
bos_token="<s>",
eos_token="</s>",
pad_token="<pad>",
mask_token="<mask>",
model_max_length=model_max_length,
)
return wrapped
def compare_fast_vs_slow():
"""
Compare fast (Rust) vs slow (Python) tokenizers.
Fast tokenizers (PreTrainedTokenizerFast):
- Written in Rust via the `tokenizers` library
- 10-20x faster for batch encoding
- Support offset mapping (char-to-token alignment)
- Return BatchEncoding with additional methods
Slow tokenizers (PreTrainedTokenizer):
- Written in pure Python
- Easier to debug and modify
- Same API but fewer features
"""
from transformers import BertTokenizer, BertTokenizerFast
import time
texts = ["This is a test sentence."] * 1000
# Slow tokenizer
slow = BertTokenizer.from_pretrained("bert-base-uncased")
start = time.perf_counter()
slow(texts, padding=True, truncation=True)
slow_time = time.perf_counter() - start
# Fast tokenizer
fast = BertTokenizerFast.from_pretrained("bert-base-uncased")
start = time.perf_counter()
fast(texts, padding=True, truncation=True)
fast_time = time.perf_counter() - start
print(f"Slow tokenizer: {slow_time:.3f}s")
print(f"Fast tokenizer: {fast_time:.3f}s")
print(f"Speedup: {slow_time / fast_time:.1f}x")
def offset_mapping_demo():
"""
Demonstrate offset mappings for character-level alignment.
Offset mappings tell you which characters in the original text
correspond to each token — essential for NER and span extraction.
"""
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "HuggingFace is creating amazing NLP tools"
encoding = tokenizer(
text,
return_offsets_mapping=True,
)
print(f"Text: {text}\n")
print(f"{'Token':<15} {'ID':<8} {'Start':<6} {'End':<6} {'Original'}")
print("-" * 55)
for token, token_id, (start, end) in zip(
encoding.tokens(),
encoding["input_ids"],
encoding["offset_mapping"],
):
original = text[start:end] if start != end else "[special]"
print(f"{token:<15} {token_id:<8} {start:<6} {end:<6} {original}")Fast vs Slow Tokenizers:
| Feature | Fast (Rust) | Slow (Python) |
|---|---|---|
| Speed | 10-20x faster | Baseline |
| Offset mapping | Yes | No |
| Batch encoding | Parallelized | Sequential |
| Custom modification | Harder | Easier |
| Library | tokenizers | Pure Python |
| Default in transformers | Yes (since v4.0) | Legacy |
Step 6: Comparison Demo
"""Compare tokenization algorithms side by side."""
from src.algorithms import explain_bpe, explain_wordpiece, explain_unigram
from src.analysis import compare_tokenizations
def main():
# Compare how different models tokenize the same text
compare_tokenizations(
text="The quick brown fox jumped over the lazy dog.",
model_names=[
"bert-base-uncased", # WordPiece
"gpt2", # BPE (byte-level)
"t5-small", # Unigram (SentencePiece)
],
)
# Interesting edge cases
print("=== Edge Cases ===\n")
compare_tokenizations(
text="transformers is 10x faster than tensorflow",
model_names=["bert-base-uncased", "gpt2"],
)
compare_tokenizations(
text="pneumonoultramicroscopicsilicovolcanoconiosis",
model_names=["bert-base-uncased", "gpt2"],
)
if __name__ == "__main__":
main()Running the Project
# Install dependencies
pip install -r requirements.txt
# Train a custom BPE tokenizer
python -c "
from src.train_tokenizer import train_from_datasets
tok = train_from_datasets(vocab_size=10000)
tok.save('tokenizers/custom-bpe.json')
"
# Compare algorithms
python examples/compare_algorithms.py
# Analyze BERT vocabulary
python -c "
from src.analysis import analyze_vocabulary
analyze_vocabulary('bert-base-uncased')
"
# Fast vs slow benchmark
python -c "
from src.integration import compare_fast_vs_slow
compare_fast_vs_slow()
"Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| BPE | Byte-Pair Encoding — merge most frequent pairs | Standard for modern LLMs (GPT, LLaMA) |
| WordPiece | Merge pairs that maximize likelihood | Used by BERT family; ## prefix for subwords |
| Unigram | Remove tokens that least reduce likelihood | Used by T5, ALBERT; ▁ prefix for word boundaries |
| Normalizer | Clean text before tokenization | Consistent handling of unicode, case, accents |
| Pre-tokenizer | Initial text splitting | Controls word boundary detection |
| Offset Mapping | Character-to-token alignment | Essential for NER, span extraction, highlighting |
| Fast Tokenizer | Rust-backed tokenizer | 10-20x faster than Python; supports offsets |
| Special Tokens | [CLS], [SEP], [PAD], [MASK] | Control model behavior for different tasks |
Next Steps
- Datasets Mastery — Load and process training data at scale
- Fine-Tuning with PEFT — Fine-tune models with your custom tokenizer