Tokenizers Deep Dive

TL;DR

Tokenizers convert raw text into numerical token IDs that models can process. Learn the three main algorithms (BPE, WordPiece, Unigram), train a custom tokenizer from scratch using the tokenizers library, and understand padding, truncation, and special tokens.

Understand how tokenization works at every level, then train your own custom tokenizer from scratch using HuggingFace's Rust-backed tokenizers library.

What You'll Learn

BPE, WordPiece, and Unigram tokenization algorithms
Training custom tokenizers from scratch
Fast vs slow tokenizers and performance differences
Padding, truncation, and attention mask strategies
Vocabulary analysis and token distribution
Integrating custom tokenizers with transformers

Tech Stack

Component	Technology
Tokenizer Training	`tokenizers` (Rust-backed)
Integration	`transformers`
Analysis	`matplotlib`, `collections`
Python	3.10+

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                          TOKENIZER TRAINING PIPELINE                         │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────────┐  │
│  │ Raw Text │──▶│ Normalizer   │──▶│ Pre-tokenizer│──▶│ Model Training   │  │
│  │ Corpus   │   │ (lowercase,  │   │ (whitespace, │   │ (BPE/WordPiece/  │  │
│  │          │   │  unicode)    │   │  punctuation)│   │  Unigram)        │  │
│  └──────────┘   └──────────────┘   └──────────────┘   └───────┬──────────┘  │
│                                                                │             │
│                                                                ▼             │
│                                              ┌──────────────────────────┐    │
│                                              │ Trained Tokenizer        │    │
│                                              │ ┌──────────────────────┐ │    │
│                                              │ │ Vocabulary (30K IDs) │ │    │
│                                              │ │ Merge rules          │ │    │
│                                              │ │ Special tokens       │ │    │
│                                              │ └──────────────────────┘ │    │
│                                              └──────────────────────────┘    │
│                                                                              │
│  INFERENCE:  text ──► normalize ──► pre-tokenize ──► model ──► post-process  │
│              "Hello world!" → ["hello", "world", "!"] → [101, 7592, 2088, 999, 102]
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Project Structure

tokenizers-deep-dive/
├── src/
│   ├── __init__.py
│   ├── algorithms.py         # BPE, WordPiece, Unigram implementations
│   ├── train_tokenizer.py    # Train custom tokenizer from scratch
│   ├── analysis.py           # Vocabulary analysis and statistics
│   └── integration.py        # Use custom tokenizer with transformers
├── data/
│   └── corpus.txt            # Training corpus
├── tokenizers/               # Saved tokenizer files
├── examples/
│   └── compare_algorithms.py
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

tokenizers>=0.19.0
transformers>=4.40.0
matplotlib>=3.8.0
datasets>=2.19.0

Step 2: Understanding the Algorithms

src/algorithms.py

"""Demonstrate tokenization algorithms with educational examples."""

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers


def explain_bpe():
    """
    Byte-Pair Encoding (BPE):

    Starts with character-level tokens and iteratively merges
    the most frequent adjacent pair into a new token.

    Used by: GPT-2, GPT-3, GPT-4, LLaMA, Mistral
    """
    # Create a BPE tokenizer
    tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

    # Normalization: lowercase, strip accents
    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFD(),
        normalizers.Lowercase(),
        normalizers.StripAccents(),
    ])

    # Pre-tokenization: split on whitespace
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

    # BPE trainer configuration
    trainer = trainers.BpeTrainer(
        vocab_size=1000,
        min_frequency=2,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        show_progress=True,
    )

    return tokenizer, trainer


def explain_wordpiece():
    """
    WordPiece:

    Similar to BPE but uses likelihood maximization instead of
    frequency. Merges the pair that maximizes the language model
    likelihood of the training data.

    The ## prefix indicates a subword continuation.

    Used by: BERT, DistilBERT, Electra
    """
    tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFD(),
        normalizers.Lowercase(),
        normalizers.StripAccents(),
    ])

    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

    trainer = trainers.WordPieceTrainer(
        vocab_size=1000,
        min_frequency=2,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        continuing_subword_prefix="##",
    )

    return tokenizer, trainer


def explain_unigram():
    """
    Unigram (SentencePiece):

    Starts with a large vocabulary and iteratively removes tokens
    that least decrease the overall likelihood. Uses the EM algorithm
    to find the optimal tokenization.

    The ▁ prefix indicates a word boundary.

    Used by: T5, ALBERT, XLNet, mBART
    """
    tokenizer = Tokenizer(models.Unigram())

    tokenizer.normalizer = normalizers.Sequence([
        normalizers.Nmt(),
        normalizers.NFKC(),
    ])

    tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

    trainer = trainers.UnigramTrainer(
        vocab_size=1000,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        unk_token="[UNK]",
    )

    return tokenizer, trainer

How BPE Merging Works:

┌─────────────────────────────────────────────────────────────────┐
│ BPE MERGE PROCESS (simplified)                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Corpus: ["low", "lower", "newest", "widest"]                    │
│                                                                  │
│  Step 0 — Character level:                                       │
│  l o w   l o w e r   n e w e s t   w i d e s t                   │
│                                                                  │
│  Step 1 — Most frequent pair: (e, s) → merge into "es"           │
│  l o w   l o w e r   n e w es t   w i d es t                     │
│                                                                  │
│  Step 2 — Most frequent pair: (es, t) → merge into "est"         │
│  l o w   l o w e r   n e w est   w i d est                       │
│                                                                  │
│  Step 3 — Most frequent pair: (l, o) → merge into "lo"           │
│  lo w   lo w e r   n e w est   w i d est                         │
│                                                                  │
│  Step 4 — Most frequent pair: (lo, w) → merge into "low"         │
│  low   low e r   n e w est   w i d est                           │
│                                                                  │
│  ...continue until vocab_size reached                            │
│                                                                  │
│  Final vocabulary includes both characters and merged tokens.    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Algorithm Comparison:

Algorithm	Merge Strategy	Subword Prefix	Used By
BPE	Most frequent pair	None (byte-level)	GPT-2/3/4, LLaMA, Mistral
WordPiece	Max likelihood pair	`##` continuation	BERT, DistilBERT
Unigram	Remove least useful	`▁` word boundary	T5, ALBERT, XLNet

Step 3: Train a Custom Tokenizer

src/train_tokenizer.py

"""Train a custom tokenizer from scratch."""

from tokenizers import (
    Tokenizer,
    models,
    trainers,
    pre_tokenizers,
    normalizers,
    processors,
    decoders,
)
from pathlib import Path


def train_bpe_tokenizer(
    corpus_files: list[str],
    vocab_size: int = 30000,
    min_frequency: int = 2,
    output_dir: str = "tokenizers",
) -> Tokenizer:
    """
    Train a BPE tokenizer from scratch.

    Args:
        corpus_files: List of text file paths for training
        vocab_size: Target vocabulary size
        min_frequency: Minimum token frequency to keep
        output_dir: Directory to save the trained tokenizer
    """
    # Initialize with BPE model
    tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))

    # Normalization pipeline
    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFD(),           # Unicode decomposition
        normalizers.StripAccents(),  # Remove diacritics
        normalizers.Lowercase(),     # Lowercase all text
    ])

    # Pre-tokenization: split on whitespace and punctuation
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
        add_prefix_space=False
    )

    # Decoder (reverses pre-tokenization for readable output)
    tokenizer.decoder = decoders.ByteLevel()

    # Define special tokens
    special_tokens = ["<unk>", "<s>", "</s>", "<pad>", "<mask>"]

    # Configure trainer
    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        min_frequency=min_frequency,
        special_tokens=special_tokens,
        show_progress=True,
        initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    )

    # Train on corpus files
    tokenizer.train(corpus_files, trainer)

    # Post-processing: add special tokens around sequences
    tokenizer.post_processor = processors.TemplateProcessing(
        single="<s> $A </s>",
        pair="<s> $A </s> $B:1 </s>:1",
        special_tokens=[
            ("<s>", tokenizer.token_to_id("<s>")),
            ("</s>", tokenizer.token_to_id("</s>")),
        ],
    )

    # Save tokenizer
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    tokenizer.save(str(output_path / "custom-bpe.json"))

    print(f"Tokenizer trained with vocab size: {tokenizer.get_vocab_size()}")
    return tokenizer


def train_from_datasets(
    dataset_name: str = "wikitext",
    dataset_config: str = "wikitext-2-raw-v1",
    vocab_size: int = 30000,
) -> Tokenizer:
    """Train a tokenizer directly from a HuggingFace dataset."""
    from datasets import load_dataset

    dataset = load_dataset(dataset_name, dataset_config, split="train")

    # Create an iterator over the text
    def batch_iterator(batch_size=1000):
        for i in range(0, len(dataset), batch_size):
            yield dataset[i : i + batch_size]["text"]

    tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
    tokenizer.normalizer = normalizers.NFKC()
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
    tokenizer.decoder = decoders.ByteLevel()

    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=["<unk>", "<s>", "</s>", "<pad>"],
        initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    )

    # Train from iterator (memory-efficient for large datasets)
    tokenizer.train_from_iterator(batch_iterator(), trainer)

    return tokenizer


def demonstrate_encoding(tokenizer: Tokenizer, texts: list[str]):
    """Show encoding results for a list of texts."""
    for text in texts:
        encoding = tokenizer.encode(text)
        print(f"\nText: {text}")
        print(f"  Tokens: {encoding.tokens}")
        print(f"  IDs:    {encoding.ids}")
        print(f"  Length: {len(encoding.ids)}")

Tokenizer Training Pipeline:

┌─────────────────────────────────────────────────────────────────┐
│ TRAINING PIPELINE COMPONENTS                                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. NORMALIZER — Clean the input text                            │
│     "Café résumé" → "cafe resume"                                │
│     ┌──────────┬──────────────────────────────────────┐          │
│     │ NFD      │ Unicode decomposition (split accents)│          │
│     │ Strip    │ Remove accent marks                   │          │
│     │ Lower    │ "Hello" → "hello"                     │          │
│     │ NFKC     │ Compatibility normalization            │          │
│     └──────────┴──────────────────────────────────────┘          │
│                                                                  │
│  2. PRE-TOKENIZER — Split into initial chunks                    │
│     "hello world!" → ["hello", "world", "!"]                     │
│     ┌───────────┬─────────────────────────────────────┐          │
│     │ Whitespace│ Split on spaces                      │          │
│     │ ByteLevel │ UTF-8 byte representation            │          │
│     │ Metaspace │ Prefix word boundaries with ▁        │          │
│     └───────────┴─────────────────────────────────────┘          │
│                                                                  │
│  3. MODEL — Apply the tokenization algorithm                     │
│     ["hello"] → ["hel", "lo"] (subword split)                    │
│                                                                  │
│  4. POST-PROCESSOR — Add special tokens                          │
│     ["hel", "lo"] → ["<s>", "hel", "lo", "</s>"]                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 4: Padding and Truncation Strategies

src/analysis.py

"""Tokenizer analysis and padding/truncation strategies."""

from transformers import AutoTokenizer
from collections import Counter


def demonstrate_padding_strategies(model_name: str = "bert-base-uncased"):
    """Show different padding and truncation strategies."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    texts = [
        "Short text.",
        "This is a medium length sentence for testing.",
        "This is a longer sentence that contains more tokens and will help "
        "demonstrate how truncation works with different strategies.",
    ]

    # Strategy 1: Pad to longest in batch
    batch_longest = tokenizer(
        texts,
        padding="longest",
        return_tensors="pt",
    )
    print("=== padding='longest' ===")
    print(f"  Shape: {batch_longest['input_ids'].shape}")

    # Strategy 2: Pad to max model length
    batch_max = tokenizer(
        texts,
        padding="max_length",
        max_length=32,
        truncation=True,
        return_tensors="pt",
    )
    print(f"\n=== padding='max_length', max_length=32 ===")
    print(f"  Shape: {batch_max['input_ids'].shape}")

    # Strategy 3: No padding (returns lists)
    batch_none = tokenizer(texts, padding=False)
    print(f"\n=== padding=False ===")
    for i, ids in enumerate(batch_none["input_ids"]):
        print(f"  Text {i}: {len(ids)} tokens")

    return batch_longest, batch_max, batch_none


def analyze_vocabulary(model_name: str = "bert-base-uncased"):
    """Analyze a tokenizer's vocabulary statistics."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    vocab = tokenizer.get_vocab()

    # Basic stats
    print(f"Model: {model_name}")
    print(f"Vocab size: {len(vocab)}")
    print(f"Max token length: {max(len(t) for t in vocab.keys())}")

    # Token type distribution
    subword_count = sum(1 for t in vocab if t.startswith("##"))
    special_count = sum(1 for t in vocab if t.startswith("[") and t.endswith("]"))
    regular_count = len(vocab) - subword_count - special_count

    print(f"\nToken types:")
    print(f"  Regular words:  {regular_count:>6,}")
    print(f"  Subwords (##):  {subword_count:>6,}")
    print(f"  Special tokens: {special_count:>6,}")

    # Character coverage
    chars = set()
    for token in vocab:
        chars.update(token.replace("##", ""))
    print(f"  Unique chars:   {len(chars):>6,}")

    return vocab


def compare_tokenizations(text: str, model_names: list[str]):
    """Compare how different models tokenize the same text."""
    print(f"Text: \"{text}\"\n")

    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        tokens = tokenizer.tokenize(text)
        ids = tokenizer.encode(text)

        print(f"  {model_name}:")
        print(f"    Tokens ({len(tokens)}): {tokens}")
        print(f"    IDs: {ids}")
        print()


def token_frequency_analysis(
    texts: list[str],
    model_name: str = "bert-base-uncased",
) -> dict:
    """Analyze token frequency distribution in a corpus."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    token_counts = Counter()
    total_tokens = 0

    for text in texts:
        tokens = tokenizer.tokenize(text)
        token_counts.update(tokens)
        total_tokens += len(tokens)

    # Summary statistics
    print(f"Total tokens: {total_tokens:,}")
    print(f"Unique tokens: {len(token_counts):,}")
    print(f"Avg tokens per text: {total_tokens / len(texts):.1f}")

    print(f"\nTop 20 tokens:")
    for token, count in token_counts.most_common(20):
        pct = count / total_tokens * 100
        print(f"  {token:20s} {count:>8,} ({pct:.1f}%)")

    return dict(token_counts)

Padding and Truncation Explained:

┌─────────────────────────────────────────────────────────────────┐
│ PADDING & TRUNCATION STRATEGIES                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Input:  "Short"        = 3 tokens                               │
│          "Medium text"  = 5 tokens                               │
│          "A longer one" = 7 tokens                               │
│                                                                  │
│  padding="longest" (pad to longest in batch):                    │
│  ┌───┬───┬───┬───┬───┬───┬───┐                                  │
│  │ S │ h │ o │PAD│PAD│PAD│PAD│  3 real + 4 padding               │
│  │ M │ e │ d │ i │ u │PAD│PAD│  5 real + 2 padding               │
│  │ A │ l │ o │ n │ g │ e │ r │  7 real + 0 padding               │
│  └───┴───┴───┴───┴───┴───┴───┘                                  │
│  Attention mask: 1=real, 0=padding (model ignores padding)       │
│                                                                  │
│  padding="max_length", max_length=5, truncation=True:            │
│  ┌───┬───┬───┬───┬───┐                                          │
│  │ S │ h │ o │PAD│PAD│  padded to max_length                     │
│  │ M │ e │ d │ i │ u │  exact fit                                │
│  │ A │ l │ o │ n │ g │  TRUNCATED (lost "e" and "r")             │
│  └───┴───┴───┴───┴───┘                                          │
│                                                                  │
│  When to use which:                                              │
│  • "longest"    → Dynamic batching, saves memory                 │
│  • "max_length" → Fixed shapes, needed for some hardware (TPU)   │
│  • False        → Custom collation in DataLoader                 │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 5: Integration with Transformers

src/integration.py

"""Integrate custom tokenizer with the transformers library."""

from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
from tokenizers import Tokenizer


def wrap_for_transformers(
    tokenizer_path: str,
    model_max_length: int = 512,
) -> PreTrainedTokenizerFast:
    """
    Wrap a tokenizers.Tokenizer as a transformers-compatible tokenizer.

    This allows using a custom-trained tokenizer with any
    transformers model, trainer, or pipeline.
    """
    # Load the raw tokenizer
    raw_tokenizer = Tokenizer.from_file(tokenizer_path)

    # Wrap it for transformers compatibility
    wrapped = PreTrainedTokenizerFast(
        tokenizer_object=raw_tokenizer,
        unk_token="<unk>",
        bos_token="<s>",
        eos_token="</s>",
        pad_token="<pad>",
        mask_token="<mask>",
        model_max_length=model_max_length,
    )

    return wrapped


def compare_fast_vs_slow():
    """
    Compare fast (Rust) vs slow (Python) tokenizers.

    Fast tokenizers (PreTrainedTokenizerFast):
    - Written in Rust via the `tokenizers` library
    - 10-20x faster for batch encoding
    - Support offset mapping (char-to-token alignment)
    - Return BatchEncoding with additional methods

    Slow tokenizers (PreTrainedTokenizer):
    - Written in pure Python
    - Easier to debug and modify
    - Same API but fewer features
    """
    from transformers import BertTokenizer, BertTokenizerFast
    import time

    texts = ["This is a test sentence."] * 1000

    # Slow tokenizer
    slow = BertTokenizer.from_pretrained("bert-base-uncased")
    start = time.perf_counter()
    slow(texts, padding=True, truncation=True)
    slow_time = time.perf_counter() - start

    # Fast tokenizer
    fast = BertTokenizerFast.from_pretrained("bert-base-uncased")
    start = time.perf_counter()
    fast(texts, padding=True, truncation=True)
    fast_time = time.perf_counter() - start

    print(f"Slow tokenizer: {slow_time:.3f}s")
    print(f"Fast tokenizer: {fast_time:.3f}s")
    print(f"Speedup: {slow_time / fast_time:.1f}x")


def offset_mapping_demo():
    """
    Demonstrate offset mappings for character-level alignment.

    Offset mappings tell you which characters in the original text
    correspond to each token — essential for NER and span extraction.
    """
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    text = "HuggingFace is creating amazing NLP tools"

    encoding = tokenizer(
        text,
        return_offsets_mapping=True,
    )

    print(f"Text: {text}\n")
    print(f"{'Token':<15} {'ID':<8} {'Start':<6} {'End':<6} {'Original'}")
    print("-" * 55)

    for token, token_id, (start, end) in zip(
        encoding.tokens(),
        encoding["input_ids"],
        encoding["offset_mapping"],
    ):
        original = text[start:end] if start != end else "[special]"
        print(f"{token:<15} {token_id:<8} {start:<6} {end:<6} {original}")

Fast vs Slow Tokenizers:

Feature	Fast (Rust)	Slow (Python)
Speed	10-20x faster	Baseline
Offset mapping	Yes	No
Batch encoding	Parallelized	Sequential
Custom modification	Harder	Easier
Library	`tokenizers`	Pure Python
Default in transformers	Yes (since v4.0)	Legacy

Step 6: Comparison Demo

examples/compare_algorithms.py

"""Compare tokenization algorithms side by side."""

from src.algorithms import explain_bpe, explain_wordpiece, explain_unigram
from src.analysis import compare_tokenizations


def main():
    # Compare how different models tokenize the same text
    compare_tokenizations(
        text="The quick brown fox jumped over the lazy dog.",
        model_names=[
            "bert-base-uncased",    # WordPiece
            "gpt2",                 # BPE (byte-level)
            "t5-small",             # Unigram (SentencePiece)
        ],
    )

    # Interesting edge cases
    print("=== Edge Cases ===\n")
    compare_tokenizations(
        text="transformers is 10x faster than tensorflow",
        model_names=["bert-base-uncased", "gpt2"],
    )

    compare_tokenizations(
        text="pneumonoultramicroscopicsilicovolcanoconiosis",
        model_names=["bert-base-uncased", "gpt2"],
    )


if __name__ == "__main__":
    main()

Running the Project

# Install dependencies
pip install -r requirements.txt

# Train a custom BPE tokenizer
python -c "
from src.train_tokenizer import train_from_datasets
tok = train_from_datasets(vocab_size=10000)
tok.save('tokenizers/custom-bpe.json')
"

# Compare algorithms
python examples/compare_algorithms.py

# Analyze BERT vocabulary
python -c "
from src.analysis import analyze_vocabulary
analyze_vocabulary('bert-base-uncased')
"

# Fast vs slow benchmark
python -c "
from src.integration import compare_fast_vs_slow
compare_fast_vs_slow()
"

Key Concepts Recap

Concept	What It Is	Why It Matters
BPE	Byte-Pair Encoding — merge most frequent pairs	Standard for modern LLMs (GPT, LLaMA)
WordPiece	Merge pairs that maximize likelihood	Used by BERT family; `##` prefix for subwords
Unigram	Remove tokens that least reduce likelihood	Used by T5, ALBERT; `▁` prefix for word boundaries
Normalizer	Clean text before tokenization	Consistent handling of unicode, case, accents
Pre-tokenizer	Initial text splitting	Controls word boundary detection
Offset Mapping	Character-to-token alignment	Essential for NER, span extraction, highlighting
Fast Tokenizer	Rust-backed tokenizer	10-20x faster than Python; supports offsets
Special Tokens	`[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`	Control model behavior for different tasks

Next Steps

Datasets Mastery — Load and process training data at scale
Fine-Tuning with PEFT — Fine-tune models with your custom tokenizer

Tokenizers Deep Dive

TL;DR

Understand how tokenization works at every level, then train your own custom tokenizer from scratch using HuggingFace's Rust-backed tokenizers library.

What You'll Learn

BPE, WordPiece, and Unigram tokenization algorithms
Training custom tokenizers from scratch
Fast vs slow tokenizers and performance differences
Padding, truncation, and attention mask strategies
Vocabulary analysis and token distribution
Integrating custom tokenizers with transformers

Tech Stack

Component	Technology
Tokenizer Training	`tokenizers` (Rust-backed)
Integration	`transformers`
Analysis	`matplotlib`, `collections`
Python	3.10+

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                          TOKENIZER TRAINING PIPELINE                         │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────────┐  │
│  │ Raw Text │──▶│ Normalizer   │──▶│ Pre-tokenizer│──▶│ Model Training   │  │
│  │ Corpus   │   │ (lowercase,  │   │ (whitespace, │   │ (BPE/WordPiece/  │  │
│  │          │   │  unicode)    │   │  punctuation)│   │  Unigram)        │  │
│  └──────────┘   └──────────────┘   └──────────────┘   └───────┬──────────┘  │
│                                                                │             │
│                                                                ▼             │
│                                              ┌──────────────────────────┐    │
│                                              │ Trained Tokenizer        │    │
│                                              │ ┌──────────────────────┐ │    │
│                                              │ │ Vocabulary (30K IDs) │ │    │
│                                              │ │ Merge rules          │ │    │
│                                              │ │ Special tokens       │ │    │
│                                              │ └──────────────────────┘ │    │
│                                              └──────────────────────────┘    │
│                                                                              │
│  INFERENCE:  text ──► normalize ──► pre-tokenize ──► model ──► post-process  │
│              "Hello world!" → ["hello", "world", "!"] → [101, 7592, 2088, 999, 102]
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Project Structure

tokenizers-deep-dive/
├── src/
│   ├── __init__.py
│   ├── algorithms.py         # BPE, WordPiece, Unigram implementations
│   ├── train_tokenizer.py    # Train custom tokenizer from scratch
│   ├── analysis.py           # Vocabulary analysis and statistics
│   └── integration.py        # Use custom tokenizer with transformers
├── data/
│   └── corpus.txt            # Training corpus
├── tokenizers/               # Saved tokenizer files
├── examples/
│   └── compare_algorithms.py
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

tokenizers>=0.19.0
transformers>=4.40.0
matplotlib>=3.8.0
datasets>=2.19.0

Step 2: Understanding the Algorithms

src/algorithms.py

"""Demonstrate tokenization algorithms with educational examples."""

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers


def explain_bpe():
    """
    Byte-Pair Encoding (BPE):

    Starts with character-level tokens and iteratively merges
    the most frequent adjacent pair into a new token.

    Used by: GPT-2, GPT-3, GPT-4, LLaMA, Mistral
    """
    # Create a BPE tokenizer
    tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

    # Normalization: lowercase, strip accents
    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFD(),
        normalizers.Lowercase(),
        normalizers.StripAccents(),
    ])

    # Pre-tokenization: split on whitespace
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

    # BPE trainer configuration
    trainer = trainers.BpeTrainer(
        vocab_size=1000,
        min_frequency=2,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        show_progress=True,
    )

    return tokenizer, trainer


def explain_wordpiece():
    """
    WordPiece:

    Similar to BPE but uses likelihood maximization instead of
    frequency. Merges the pair that maximizes the language model
    likelihood of the training data.

    The ## prefix indicates a subword continuation.

    Used by: BERT, DistilBERT, Electra
    """
    tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFD(),
        normalizers.Lowercase(),
        normalizers.StripAccents(),
    ])

    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

    trainer = trainers.WordPieceTrainer(
        vocab_size=1000,
        min_frequency=2,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        continuing_subword_prefix="##",
    )

    return tokenizer, trainer


def explain_unigram():
    """
    Unigram (SentencePiece):

    Starts with a large vocabulary and iteratively removes tokens
    that least decrease the overall likelihood. Uses the EM algorithm
    to find the optimal tokenization.

    The ▁ prefix indicates a word boundary.

    Used by: T5, ALBERT, XLNet, mBART
    """
    tokenizer = Tokenizer(models.Unigram())

    tokenizer.normalizer = normalizers.Sequence([
        normalizers.Nmt(),
        normalizers.NFKC(),
    ])

    tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

    trainer = trainers.UnigramTrainer(
        vocab_size=1000,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        unk_token="[UNK]",
    )

    return tokenizer, trainer

How BPE Merging Works:

┌─────────────────────────────────────────────────────────────────┐
│ BPE MERGE PROCESS (simplified)                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Corpus: ["low", "lower", "newest", "widest"]                    │
│                                                                  │
│  Step 0 — Character level:                                       │
│  l o w   l o w e r   n e w e s t   w i d e s t                   │
│                                                                  │
│  Step 1 — Most frequent pair: (e, s) → merge into "es"           │
│  l o w   l o w e r   n e w es t   w i d es t                     │
│                                                                  │
│  Step 2 — Most frequent pair: (es, t) → merge into "est"         │
│  l o w   l o w e r   n e w est   w i d est                       │
│                                                                  │
│  Step 3 — Most frequent pair: (l, o) → merge into "lo"           │
│  lo w   lo w e r   n e w est   w i d est                         │
│                                                                  │
│  Step 4 — Most frequent pair: (lo, w) → merge into "low"         │
│  low   low e r   n e w est   w i d est                           │
│                                                                  │
│  ...continue until vocab_size reached                            │
│                                                                  │
│  Final vocabulary includes both characters and merged tokens.    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Algorithm Comparison:

Algorithm	Merge Strategy	Subword Prefix	Used By
BPE	Most frequent pair	None (byte-level)	GPT-2/3/4, LLaMA, Mistral
WordPiece	Max likelihood pair	`##` continuation	BERT, DistilBERT
Unigram	Remove least useful	`▁` word boundary	T5, ALBERT, XLNet

Step 3: Train a Custom Tokenizer

src/train_tokenizer.py

"""Train a custom tokenizer from scratch."""

from tokenizers import (
    Tokenizer,
    models,
    trainers,
    pre_tokenizers,
    normalizers,
    processors,
    decoders,
)
from pathlib import Path


def train_bpe_tokenizer(
    corpus_files: list[str],
    vocab_size: int = 30000,
    min_frequency: int = 2,
    output_dir: str = "tokenizers",
) -> Tokenizer:
    """
    Train a BPE tokenizer from scratch.

    Args:
        corpus_files: List of text file paths for training
        vocab_size: Target vocabulary size
        min_frequency: Minimum token frequency to keep
        output_dir: Directory to save the trained tokenizer
    """
    # Initialize with BPE model
    tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))

    # Normalization pipeline
    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFD(),           # Unicode decomposition
        normalizers.StripAccents(),  # Remove diacritics
        normalizers.Lowercase(),     # Lowercase all text
    ])

    # Pre-tokenization: split on whitespace and punctuation
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
        add_prefix_space=False
    )

    # Decoder (reverses pre-tokenization for readable output)
    tokenizer.decoder = decoders.ByteLevel()

    # Define special tokens
    special_tokens = ["<unk>", "<s>", "</s>", "<pad>", "<mask>"]

    # Configure trainer
    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        min_frequency=min_frequency,
        special_tokens=special_tokens,
        show_progress=True,
        initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    )

    # Train on corpus files
    tokenizer.train(corpus_files, trainer)

    # Post-processing: add special tokens around sequences
    tokenizer.post_processor = processors.TemplateProcessing(
        single="<s> $A </s>",
        pair="<s> $A </s> $B:1 </s>:1",
        special_tokens=[
            ("<s>", tokenizer.token_to_id("<s>")),
            ("</s>", tokenizer.token_to_id("</s>")),
        ],
    )

    # Save tokenizer
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    tokenizer.save(str(output_path / "custom-bpe.json"))

    print(f"Tokenizer trained with vocab size: {tokenizer.get_vocab_size()}")
    return tokenizer


def train_from_datasets(
    dataset_name: str = "wikitext",
    dataset_config: str = "wikitext-2-raw-v1",
    vocab_size: int = 30000,
) -> Tokenizer:
    """Train a tokenizer directly from a HuggingFace dataset."""
    from datasets import load_dataset

    dataset = load_dataset(dataset_name, dataset_config, split="train")

    # Create an iterator over the text
    def batch_iterator(batch_size=1000):
        for i in range(0, len(dataset), batch_size):
            yield dataset[i : i + batch_size]["text"]

    tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
    tokenizer.normalizer = normalizers.NFKC()
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
    tokenizer.decoder = decoders.ByteLevel()

    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=["<unk>", "<s>", "</s>", "<pad>"],
        initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    )

    # Train from iterator (memory-efficient for large datasets)
    tokenizer.train_from_iterator(batch_iterator(), trainer)

    return tokenizer


def demonstrate_encoding(tokenizer: Tokenizer, texts: list[str]):
    """Show encoding results for a list of texts."""
    for text in texts:
        encoding = tokenizer.encode(text)
        print(f"\nText: {text}")
        print(f"  Tokens: {encoding.tokens}")
        print(f"  IDs:    {encoding.ids}")
        print(f"  Length: {len(encoding.ids)}")

Tokenizer Training Pipeline:

┌─────────────────────────────────────────────────────────────────┐
│ TRAINING PIPELINE COMPONENTS                                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. NORMALIZER — Clean the input text                            │
│     "Café résumé" → "cafe resume"                                │
│     ┌──────────┬──────────────────────────────────────┐          │
│     │ NFD      │ Unicode decomposition (split accents)│          │
│     │ Strip    │ Remove accent marks                   │          │
│     │ Lower    │ "Hello" → "hello"                     │          │
│     │ NFKC     │ Compatibility normalization            │          │
│     └──────────┴──────────────────────────────────────┘          │
│                                                                  │
│  2. PRE-TOKENIZER — Split into initial chunks                    │
│     "hello world!" → ["hello", "world", "!"]                     │
│     ┌───────────┬─────────────────────────────────────┐          │
│     │ Whitespace│ Split on spaces                      │          │
│     │ ByteLevel │ UTF-8 byte representation            │          │
│     │ Metaspace │ Prefix word boundaries with ▁        │          │
│     └───────────┴─────────────────────────────────────┘          │
│                                                                  │
│  3. MODEL — Apply the tokenization algorithm                     │
│     ["hello"] → ["hel", "lo"] (subword split)                    │
│                                                                  │
│  4. POST-PROCESSOR — Add special tokens                          │
│     ["hel", "lo"] → ["<s>", "hel", "lo", "</s>"]                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 4: Padding and Truncation Strategies

src/analysis.py

"""Tokenizer analysis and padding/truncation strategies."""

from transformers import AutoTokenizer
from collections import Counter


def demonstrate_padding_strategies(model_name: str = "bert-base-uncased"):
    """Show different padding and truncation strategies."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    texts = [
        "Short text.",
        "This is a medium length sentence for testing.",
        "This is a longer sentence that contains more tokens and will help "
        "demonstrate how truncation works with different strategies.",
    ]

    # Strategy 1: Pad to longest in batch
    batch_longest = tokenizer(
        texts,
        padding="longest",
        return_tensors="pt",
    )
    print("=== padding='longest' ===")
    print(f"  Shape: {batch_longest['input_ids'].shape}")

    # Strategy 2: Pad to max model length
    batch_max = tokenizer(
        texts,
        padding="max_length",
        max_length=32,
        truncation=True,
        return_tensors="pt",
    )
    print(f"\n=== padding='max_length', max_length=32 ===")
    print(f"  Shape: {batch_max['input_ids'].shape}")

    # Strategy 3: No padding (returns lists)
    batch_none = tokenizer(texts, padding=False)
    print(f"\n=== padding=False ===")
    for i, ids in enumerate(batch_none["input_ids"]):
        print(f"  Text {i}: {len(ids)} tokens")

    return batch_longest, batch_max, batch_none


def analyze_vocabulary(model_name: str = "bert-base-uncased"):
    """Analyze a tokenizer's vocabulary statistics."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    vocab = tokenizer.get_vocab()

    # Basic stats
    print(f"Model: {model_name}")
    print(f"Vocab size: {len(vocab)}")
    print(f"Max token length: {max(len(t) for t in vocab.keys())}")

    # Token type distribution
    subword_count = sum(1 for t in vocab if t.startswith("##"))
    special_count = sum(1 for t in vocab if t.startswith("[") and t.endswith("]"))
    regular_count = len(vocab) - subword_count - special_count

    print(f"\nToken types:")
    print(f"  Regular words:  {regular_count:>6,}")
    print(f"  Subwords (##):  {subword_count:>6,}")
    print(f"  Special tokens: {special_count:>6,}")

    # Character coverage
    chars = set()
    for token in vocab:
        chars.update(token.replace("##", ""))
    print(f"  Unique chars:   {len(chars):>6,}")

    return vocab


def compare_tokenizations(text: str, model_names: list[str]):
    """Compare how different models tokenize the same text."""
    print(f"Text: \"{text}\"\n")

    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        tokens = tokenizer.tokenize(text)
        ids = tokenizer.encode(text)

        print(f"  {model_name}:")
        print(f"    Tokens ({len(tokens)}): {tokens}")
        print(f"    IDs: {ids}")
        print()


def token_frequency_analysis(
    texts: list[str],
    model_name: str = "bert-base-uncased",
) -> dict:
    """Analyze token frequency distribution in a corpus."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    token_counts = Counter()
    total_tokens = 0

    for text in texts:
        tokens = tokenizer.tokenize(text)
        token_counts.update(tokens)
        total_tokens += len(tokens)

    # Summary statistics
    print(f"Total tokens: {total_tokens:,}")
    print(f"Unique tokens: {len(token_counts):,}")
    print(f"Avg tokens per text: {total_tokens / len(texts):.1f}")

    print(f"\nTop 20 tokens:")
    for token, count in token_counts.most_common(20):
        pct = count / total_tokens * 100
        print(f"  {token:20s} {count:>8,} ({pct:.1f}%)")

    return dict(token_counts)

Padding and Truncation Explained:

┌─────────────────────────────────────────────────────────────────┐
│ PADDING & TRUNCATION STRATEGIES                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Input:  "Short"        = 3 tokens                               │
│          "Medium text"  = 5 tokens                               │
│          "A longer one" = 7 tokens                               │
│                                                                  │
│  padding="longest" (pad to longest in batch):                    │
│  ┌───┬───┬───┬───┬───┬───┬───┐                                  │
│  │ S │ h │ o │PAD│PAD│PAD│PAD│  3 real + 4 padding               │
│  │ M │ e │ d │ i │ u │PAD│PAD│  5 real + 2 padding               │
│  │ A │ l │ o │ n │ g │ e │ r │  7 real + 0 padding               │
│  └───┴───┴───┴───┴───┴───┴───┘                                  │
│  Attention mask: 1=real, 0=padding (model ignores padding)       │
│                                                                  │
│  padding="max_length", max_length=5, truncation=True:            │
│  ┌───┬───┬───┬───┬───┐                                          │
│  │ S │ h │ o │PAD│PAD│  padded to max_length                     │
│  │ M │ e │ d │ i │ u │  exact fit                                │
│  │ A │ l │ o │ n │ g │  TRUNCATED (lost "e" and "r")             │
│  └───┴───┴───┴───┴───┘                                          │
│                                                                  │
│  When to use which:                                              │
│  • "longest"    → Dynamic batching, saves memory                 │
│  • "max_length" → Fixed shapes, needed for some hardware (TPU)   │
│  • False        → Custom collation in DataLoader                 │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 5: Integration with Transformers

src/integration.py

"""Integrate custom tokenizer with the transformers library."""

from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
from tokenizers import Tokenizer


def wrap_for_transformers(
    tokenizer_path: str,
    model_max_length: int = 512,
) -> PreTrainedTokenizerFast:
    """
    Wrap a tokenizers.Tokenizer as a transformers-compatible tokenizer.

    This allows using a custom-trained tokenizer with any
    transformers model, trainer, or pipeline.
    """
    # Load the raw tokenizer
    raw_tokenizer = Tokenizer.from_file(tokenizer_path)

    # Wrap it for transformers compatibility
    wrapped = PreTrainedTokenizerFast(
        tokenizer_object=raw_tokenizer,
        unk_token="<unk>",
        bos_token="<s>",
        eos_token="</s>",
        pad_token="<pad>",
        mask_token="<mask>",
        model_max_length=model_max_length,
    )

    return wrapped


def compare_fast_vs_slow():
    """
    Compare fast (Rust) vs slow (Python) tokenizers.

    Fast tokenizers (PreTrainedTokenizerFast):
    - Written in Rust via the `tokenizers` library
    - 10-20x faster for batch encoding
    - Support offset mapping (char-to-token alignment)
    - Return BatchEncoding with additional methods

    Slow tokenizers (PreTrainedTokenizer):
    - Written in pure Python
    - Easier to debug and modify
    - Same API but fewer features
    """
    from transformers import BertTokenizer, BertTokenizerFast
    import time

    texts = ["This is a test sentence."] * 1000

    # Slow tokenizer
    slow = BertTokenizer.from_pretrained("bert-base-uncased")
    start = time.perf_counter()
    slow(texts, padding=True, truncation=True)
    slow_time = time.perf_counter() - start

    # Fast tokenizer
    fast = BertTokenizerFast.from_pretrained("bert-base-uncased")
    start = time.perf_counter()
    fast(texts, padding=True, truncation=True)
    fast_time = time.perf_counter() - start

    print(f"Slow tokenizer: {slow_time:.3f}s")
    print(f"Fast tokenizer: {fast_time:.3f}s")
    print(f"Speedup: {slow_time / fast_time:.1f}x")


def offset_mapping_demo():
    """
    Demonstrate offset mappings for character-level alignment.

    Offset mappings tell you which characters in the original text
    correspond to each token — essential for NER and span extraction.
    """
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    text = "HuggingFace is creating amazing NLP tools"

    encoding = tokenizer(
        text,
        return_offsets_mapping=True,
    )

    print(f"Text: {text}\n")
    print(f"{'Token':<15} {'ID':<8} {'Start':<6} {'End':<6} {'Original'}")
    print("-" * 55)

    for token, token_id, (start, end) in zip(
        encoding.tokens(),
        encoding["input_ids"],
        encoding["offset_mapping"],
    ):
        original = text[start:end] if start != end else "[special]"
        print(f"{token:<15} {token_id:<8} {start:<6} {end:<6} {original}")

Fast vs Slow Tokenizers:

Feature	Fast (Rust)	Slow (Python)
Speed	10-20x faster	Baseline
Offset mapping	Yes	No
Batch encoding	Parallelized	Sequential
Custom modification	Harder	Easier
Library	`tokenizers`	Pure Python
Default in transformers	Yes (since v4.0)	Legacy

Step 6: Comparison Demo

examples/compare_algorithms.py

"""Compare tokenization algorithms side by side."""

from src.algorithms import explain_bpe, explain_wordpiece, explain_unigram
from src.analysis import compare_tokenizations


def main():
    # Compare how different models tokenize the same text
    compare_tokenizations(
        text="The quick brown fox jumped over the lazy dog.",
        model_names=[
            "bert-base-uncased",    # WordPiece
            "gpt2",                 # BPE (byte-level)
            "t5-small",             # Unigram (SentencePiece)
        ],
    )

    # Interesting edge cases
    print("=== Edge Cases ===\n")
    compare_tokenizations(
        text="transformers is 10x faster than tensorflow",
        model_names=["bert-base-uncased", "gpt2"],
    )

    compare_tokenizations(
        text="pneumonoultramicroscopicsilicovolcanoconiosis",
        model_names=["bert-base-uncased", "gpt2"],
    )


if __name__ == "__main__":
    main()

Running the Project

# Install dependencies
pip install -r requirements.txt

# Train a custom BPE tokenizer
python -c "
from src.train_tokenizer import train_from_datasets
tok = train_from_datasets(vocab_size=10000)
tok.save('tokenizers/custom-bpe.json')
"

# Compare algorithms
python examples/compare_algorithms.py

# Analyze BERT vocabulary
python -c "
from src.analysis import analyze_vocabulary
analyze_vocabulary('bert-base-uncased')
"

# Fast vs slow benchmark
python -c "
from src.integration import compare_fast_vs_slow
compare_fast_vs_slow()
"

Key Concepts Recap

Concept	What It Is	Why It Matters
BPE	Byte-Pair Encoding — merge most frequent pairs	Standard for modern LLMs (GPT, LLaMA)
WordPiece	Merge pairs that maximize likelihood	Used by BERT family; `##` prefix for subwords
Unigram	Remove tokens that least reduce likelihood	Used by T5, ALBERT; `▁` prefix for word boundaries
Normalizer	Clean text before tokenization	Consistent handling of unicode, case, accents
Pre-tokenizer	Initial text splitting	Controls word boundary detection
Offset Mapping	Character-to-token alignment	Essential for NER, span extraction, highlighting
Fast Tokenizer	Rust-backed tokenizer	10-20x faster than Python; supports offsets
Special Tokens	`[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`	Control model behavior for different tasks

Next Steps

Datasets Mastery — Load and process training data at scale
Fine-Tuning with PEFT — Fine-tune models with your custom tokenizer

Tokenizers Deep Dive

Tokenizers Deep Dive

What You'll Learn

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Understanding the Algorithms

Step 3: Train a Custom Tokenizer

Step 4: Padding and Truncation Strategies

Step 5: Integration with Transformers

Step 6: Comparison Demo

Running the Project

Key Concepts Recap

Next Steps

On this page

Tokenizers Deep Dive

Tokenizers Deep Dive

What You'll Learn

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Understanding the Algorithms

Step 3: Train a Custom Tokenizer

Step 4: Padding and Truncation Strategies

Step 5: Integration with Transformers

Step 6: Comparison Demo

Running the Project

Key Concepts Recap

Next Steps

On this page