Datasets Mastery

TL;DR

The datasets library provides memory-efficient loading, streaming, and transformation of any dataset — from Hub-hosted collections to custom CSVs. Learn load_dataset, map/filter/select operations, streaming for large datasets, and pushing your own datasets to the Hub.

What You'll Learn

Loading datasets from the Hub, local files, and custom scripts
Streaming mode for datasets that don't fit in memory
map, filter, select, and sort transformations
Building preprocessing pipelines with batched operations
Creating and publishing custom datasets to the Hub
Memory-efficient data handling with Apache Arrow

Why Apache Arrow Matters

The datasets library is built on Apache Arrow, a columnar memory format that fundamentally changes how you work with data. Unlike pandas, which loads everything into RAM, Arrow uses memory-mapped files -- the data stays on disk and is read directly without deserialization. This means you can load The Pile (800GB) on a laptop with 16GB RAM and iterate through it instantly. Batched map() operations run 10-100x faster than per-example processing because Arrow processes entire columns at once, leveraging CPU cache locality. If you are training models at any scale, understanding datasets is non-negotiable.

Property	Value
Difficulty	Beginner
Time	~3 hours
Lines of Code	~250
Prerequisites	Basic Python, familiarity with pandas helpful

Tech Stack

Component	Technology	Why
Data Loading	`datasets`	Memory-mapped Arrow backend for zero-copy reads
Hub Integration	`huggingface_hub`	Publish and share datasets with the community
Tokenization	`transformers`	Preprocessing pipelines for model training
Python	3.10+	Type hint support

Architecture

Datasets Workflow

Data Sources

HuggingFace Hub (200K+ datasets)

Local Files (CSV/JSON/Parquet/Text)

Custom Script

Processing (Apache Arrow)

load_dataset()

map → filter → select

Zero-copy reads

Memory-mapped files

Batched processing

Output

Push to Hub (push_to_hub())

Save to Disk (save_to_disk())

Streaming mode (IterableDataset)

Project Structure

datasets-mastery/
├── src/
│   ├── __init__.py
│   ├── loading.py            # Dataset loading patterns
│   ├── transforms.py         # Map, filter, and preprocessing
│   ├── streaming.py          # Streaming for large datasets
│   ├── custom_dataset.py     # Create and publish custom datasets
│   └── preprocessing.py      # NLP preprocessing pipelines
├── data/
│   ├── sample.csv
│   └── sample.jsonl
├── examples/
│   └── full_pipeline.py
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

datasets>=2.19.0
huggingface_hub>=0.23.0
transformers>=4.40.0
pandas>=2.0.0

Step 2: Loading Datasets

src/loading.py

"""Dataset loading patterns for different sources."""

from datasets import (
    load_dataset,
    load_from_disk,
    Dataset,
    DatasetDict,
    Features,
    Value,
    ClassLabel,
)


def load_from_hub():
    """Load popular datasets from HuggingFace Hub."""

    # Load full dataset with all splits
    imdb = load_dataset("imdb")
    print(f"IMDB splits: {list(imdb.keys())}")
    print(f"Train size: {len(imdb['train']):,}")
    print(f"Features: {imdb['train'].features}")

    # Load a specific split
    train = load_dataset("imdb", split="train")
    print(f"\nSingle split shape: {train.shape}")

    # Load a subset (first 1000 examples)
    subset = load_dataset("imdb", split="train[:1000]")
    print(f"Subset size: {len(subset)}")

    # Load specific columns only
    texts_only = load_dataset(
        "imdb",
        split="train[:100]",
    ).select_columns(["text"])
    print(f"Columns: {texts_only.column_names}")

    return imdb


def load_from_local_files():
    """Load datasets from local CSV, JSON, and text files."""

    # From CSV
    csv_dataset = load_dataset("csv", data_files="data/sample.csv")
    print(f"CSV dataset: {csv_dataset}")

    # From JSON Lines
    jsonl_dataset = load_dataset("json", data_files="data/sample.jsonl")
    print(f"JSONL dataset: {jsonl_dataset}")

    # From multiple files with train/test split
    split_dataset = load_dataset(
        "csv",
        data_files={
            "train": "data/train.csv",
            "test": "data/test.csv",
        },
    )

    # From text files (one example per line)
    text_dataset = load_dataset(
        "text",
        data_files="data/corpus.txt",
    )

    return csv_dataset


def create_from_dict():
    """Create a dataset from Python dictionaries."""

    # Simple creation from dict
    dataset = Dataset.from_dict({
        "text": [
            "Great product!",
            "Terrible service.",
            "Average quality.",
        ],
        "label": [1, 0, 1],
    })
    print(f"From dict: {dataset}")

    # With explicit features/schema
    features = Features({
        "text": Value("string"),
        "label": ClassLabel(names=["negative", "positive"]),
        "score": Value("float32"),
    })

    typed_dataset = Dataset.from_dict(
        {
            "text": ["Hello", "World"],
            "label": [0, 1],
            "score": [0.3, 0.9],
        },
        features=features,
    )
    print(f"With features: {typed_dataset.features}")

    return dataset


def create_from_pandas():
    """Create a dataset from a pandas DataFrame."""
    import pandas as pd

    df = pd.DataFrame({
        "question": ["What is Python?", "What is ML?"],
        "answer": ["A programming language", "Machine learning"],
        "category": ["programming", "ai"],
    })

    dataset = Dataset.from_pandas(df)
    print(f"From pandas: {dataset}")

    return dataset

Understanding the Arrow Backend:

Why Apache Arrow?

Traditional Python (pandas)

DataFrame in RAM — all data loaded into memory. 10GB dataset = 10GB+ RAM. Copies on every operation.

HuggingFace datasets (Arrow)

Recommended

Memory-mapped Arrow file — data stays on disk. 10GB dataset ≈ 0 RAM usage. Zero-copy column access, no deserialization overhead. Load The Pile (800GB) on a laptop with 16GB RAM.

Step 3: Transformations

src/transforms.py

"""Dataset transformations: map, filter, select, sort."""

from datasets import Dataset


def map_examples(dataset: Dataset) -> Dataset:
    """Apply transformations with map()."""

    # Single example map — add a new column
    def add_word_count(example):
        example["word_count"] = len(example["text"].split())
        return example

    dataset = dataset.map(add_word_count)
    print(f"Columns after map: {dataset.column_names}")

    return dataset


def batched_map(dataset: Dataset) -> Dataset:
    """
    Batched map for higher throughput.

    Processing examples in batches is 10-100x faster than
    one-at-a-time, especially for tokenization.
    """

    def tokenize_batch(batch):
        """Process a batch of examples at once."""
        # batch["text"] is a list of strings
        # Return a dict of lists (same length as input)
        batch["text_lower"] = [t.lower() for t in batch["text"]]
        batch["char_count"] = [len(t) for t in batch["text"]]
        return batch

    dataset = dataset.map(
        tokenize_batch,
        batched=True,
        batch_size=1000,
        num_proc=4,  # Parallelize across 4 CPU cores
    )

    return dataset


def filter_examples(dataset: Dataset) -> Dataset:
    """Filter dataset based on conditions."""

    # Keep only examples with more than 10 words
    filtered = dataset.filter(
        lambda x: len(x["text"].split()) > 10
    )
    print(f"Before filter: {len(dataset)}, After: {len(filtered)}")

    # Batched filter (faster)
    filtered_batch = dataset.filter(
        lambda batch: [len(t.split()) > 10 for t in batch["text"]],
        batched=True,
    )

    return filtered


def select_and_sort(dataset: Dataset) -> Dataset:
    """Select specific examples and sort."""

    # Select by indices
    subset = dataset.select(range(100))
    print(f"Selected 100: {len(subset)}")

    # Random shuffle
    shuffled = dataset.shuffle(seed=42)

    # Sort by a column
    if "word_count" in dataset.column_names:
        sorted_ds = dataset.sort("word_count", reverse=True)
        print(f"Longest text: {sorted_ds[0]['word_count']} words")

    # Train/test split
    splits = dataset.train_test_split(test_size=0.2, seed=42)
    print(f"Train: {len(splits['train'])}, Test: {len(splits['test'])}")

    return splits


def rename_and_remove(dataset: Dataset) -> Dataset:
    """Rename and remove columns."""

    # Rename columns
    dataset = dataset.rename_column("text", "input_text")

    # Remove columns
    dataset = dataset.remove_columns(["word_count"])

    # Cast column types
    dataset = dataset.cast_column("label", "int32")

    return dataset

Map Operations Compared:

Mode	Speed	Memory	Use Case
`map(fn)`	Slow	Low per-example	Simple per-example transforms
`map(fn, batched=True)`	Fast	Moderate	Tokenization, batch processing
`map(fn, batched=True, num_proc=4)`	Fastest	Higher	CPU-bound transforms

Step 4: Streaming for Large Datasets

src/streaming.py

"""Streaming datasets for data too large to fit in memory."""

from datasets import load_dataset
from itertools import islice


def stream_large_dataset():
    """
    Load a dataset in streaming mode.

    Streaming downloads and processes data on-the-fly,
    never loading the full dataset into memory.
    """
    # Stream C4 (305GB compressed, ~750GB uncompressed)
    stream = load_dataset(
        "allenai/c4",
        "en",
        split="train",
        streaming=True,
    )

    # Peek at first 5 examples
    for i, example in enumerate(stream):
        print(f"  Example {i}: {example['text'][:80]}...")
        if i >= 4:
            break


def stream_with_transforms():
    """Apply transforms to a streaming dataset."""
    stream = load_dataset(
        "imdb",
        split="train",
        streaming=True,
    )

    # Map transforms work on streams
    processed = stream.map(
        lambda x: {"text_length": len(x["text"])}
    )

    # Filter works on streams
    long_reviews = processed.filter(
        lambda x: x["text_length"] > 1000
    )

    # Take first N examples
    samples = list(islice(long_reviews, 10))
    print(f"Got {len(samples)} long reviews")

    return samples


def stream_and_tokenize():
    """Tokenize a streaming dataset for training."""
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    stream = load_dataset(
        "wikitext",
        "wikitext-2-raw-v1",
        split="train",
        streaming=True,
    )

    def tokenize(example):
        return tokenizer(
            example["text"],
            truncation=True,
            max_length=512,
            padding="max_length",
        )

    tokenized_stream = stream.map(tokenize)

    # Use with PyTorch DataLoader
    batch = list(islice(tokenized_stream, 32))
    print(f"Batch of {len(batch)} tokenized examples")
    print(f"Keys: {list(batch[0].keys())}")

    return tokenized_stream


def interleave_datasets():
    """Combine multiple streaming datasets."""
    from datasets import interleave_datasets

    en = load_dataset(
        "oscar-corpus/OSCAR-2301",
        "en",
        split="train",
        streaming=True,
    )
    fr = load_dataset(
        "oscar-corpus/OSCAR-2301",
        "fr",
        split="train",
        streaming=True,
    )

    # Interleave with equal probability
    combined = interleave_datasets([en, fr])

    # Or with custom weights (70% English, 30% French)
    weighted = interleave_datasets(
        [en, fr],
        probabilities=[0.7, 0.3],
        seed=42,
    )

    return weighted

Streaming vs Regular Loading:

Regular vs Streaming

Regular Loading

Download ALL data (305 GB, takes hours), convert to Arrow format (hours), memory-map full dataset (750GB disk). Supports random access to any example.

Streaming

Recommended

No download upfront (instant start), fetch examples on-demand (~0 MB), process sequentially. No random access — sequential only. Use when dataset is too large for disk, you only need one pass, or you want to start processing immediately.

Step 5: Custom Dataset Creation

src/custom_dataset.py

"""Create and publish custom datasets."""

from datasets import Dataset, DatasetDict, Features, Value, ClassLabel
from huggingface_hub import HfApi


def create_qa_dataset() -> DatasetDict:
    """Create a question-answering dataset from scratch."""
    train_data = {
        "question": [
            "What is machine learning?",
            "What is a neural network?",
            "What is gradient descent?",
        ],
        "answer": [
            "A field of AI that learns patterns from data.",
            "A computational model inspired by biological neurons.",
            "An optimization algorithm that minimizes loss by following gradients.",
        ],
        "category": ["fundamentals", "architecture", "optimization"],
        "difficulty": ["easy", "medium", "medium"],
    }

    test_data = {
        "question": ["What is overfitting?"],
        "answer": ["When a model memorizes training data instead of learning patterns."],
        "category": ["fundamentals"],
        "difficulty": ["easy"],
    }

    features = Features({
        "question": Value("string"),
        "answer": Value("string"),
        "category": ClassLabel(names=["fundamentals", "architecture", "optimization"]),
        "difficulty": ClassLabel(names=["easy", "medium", "hard"]),
    })

    dataset = DatasetDict({
        "train": Dataset.from_dict(train_data, features=features),
        "test": Dataset.from_dict(test_data, features=features),
    })

    return dataset


def push_dataset_to_hub(
    dataset: DatasetDict,
    repo_id: str,
    private: bool = False,
):
    """
    Push a dataset to HuggingFace Hub.

    Args:
        dataset: The DatasetDict to push
        repo_id: Hub repository ID (e.g., "username/my-dataset")
        private: Whether the repo should be private
    """
    dataset.push_to_hub(
        repo_id,
        private=private,
    )
    print(f"Dataset pushed to: https://huggingface.co/datasets/{repo_id}")


def save_and_load_locally(dataset: DatasetDict, path: str = "data/my-dataset"):
    """Save dataset to disk and reload it."""
    # Save (Arrow format — very fast)
    dataset.save_to_disk(path)
    print(f"Saved to {path}")

    # Reload
    from datasets import load_from_disk
    reloaded = load_from_disk(path)
    print(f"Reloaded: {reloaded}")

    return reloaded

Step 6: Preprocessing Pipeline

src/preprocessing.py

"""Complete NLP preprocessing pipeline with datasets."""

from datasets import load_dataset
from transformers import AutoTokenizer


def build_classification_pipeline(
    dataset_name: str = "imdb",
    model_name: str = "bert-base-uncased",
    max_length: int = 512,
):
    """
    Build a complete preprocessing pipeline for text classification.

    Steps: load → clean → tokenize → format → split
    """
    # Load
    dataset = load_dataset(dataset_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Clean: remove HTML tags, normalize whitespace
    import re

    def clean_text(example):
        text = example["text"]
        text = re.sub(r"<[^>]+>", "", text)       # Remove HTML
        text = re.sub(r"\s+", " ", text).strip()   # Normalize whitespace
        example["text"] = text
        return example

    dataset = dataset.map(clean_text)

    # Tokenize (batched for speed)
    def tokenize(batch):
        return tokenizer(
            batch["text"],
            truncation=True,
            max_length=max_length,
            padding="max_length",
        )

    dataset = dataset.map(
        tokenize,
        batched=True,
        batch_size=1000,
        remove_columns=["text"],  # Remove raw text, keep only token IDs
    )

    # Set format for PyTorch
    dataset.set_format("torch")

    print(f"Final columns: {dataset['train'].column_names}")
    print(f"Example keys: {list(dataset['train'][0].keys())}")
    print(f"input_ids shape: {dataset['train'][0]['input_ids'].shape}")

    return dataset


def build_language_modeling_pipeline(
    dataset_name: str = "wikitext",
    dataset_config: str = "wikitext-2-raw-v1",
    model_name: str = "gpt2",
    block_size: int = 128,
):
    """
    Build a preprocessing pipeline for causal language modeling.

    Concatenates all texts and splits into fixed-length blocks.
    """
    dataset = load_dataset(dataset_name, dataset_config)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Tokenize all texts
    def tokenize(batch):
        return tokenizer(batch["text"])

    tokenized = dataset.map(
        tokenize,
        batched=True,
        remove_columns=dataset["train"].column_names,
    )

    # Group texts into fixed-length blocks
    def group_texts(batch):
        # Concatenate all input_ids
        concatenated = {k: sum(batch[k], []) for k in batch.keys()}
        total_length = len(concatenated["input_ids"])

        # Truncate to multiple of block_size
        total_length = (total_length // block_size) * block_size

        # Split into chunks
        result = {
            k: [v[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, v in concatenated.items()
        }

        # Labels = input_ids (shifted by 1 during training)
        result["labels"] = result["input_ids"].copy()
        return result

    lm_dataset = tokenized.map(
        group_texts,
        batched=True,
    )

    print(f"Training examples: {len(lm_dataset['train'])}")
    print(f"Each example: {block_size} tokens")

    return lm_dataset

Understanding the Language Modeling Pipeline:

The group_texts function in build_language_modeling_pipeline is a critical pattern for causal LM training. Instead of padding short texts or truncating long ones (which wastes compute), it concatenates all tokenized texts into one long sequence and then slices it into fixed-length blocks. This means every training example has exactly block_size tokens with zero padding. The line result["labels"] = result["input_ids"].copy() sets up the standard next-token prediction objective: during training, the model shifts labels by one position internally so token i predicts token i+1.

Preprocessing Pipeline Flow:

Classification Preprocessing

Raw Data{"text": "<br/>Great movie!<br/>", "label": 1}

clean_text(){"text": "Great movie!", "label": 1} — remove HTML, normalize whitespace

tokenize() batched=True{"input_ids": [101, 2307, 3185, 999, 102, 0, 0, ...], "attention_mask": [1, 1, 1, 1, 1, 0, 0, ...], "label": 1}

set_format("torch"){input_ids: tensor([101, 2307, ...]), attention_mask: tensor([1, 1, ...]), label: tensor(1)}

ReadyReady for DataLoader and training!

Running the Project

# Install dependencies
pip install -r requirements.txt

# Load and explore a dataset
python -c "
from src.loading import load_from_hub
ds = load_from_hub()
print(ds['train'][0])
"

# Run full preprocessing pipeline
python -c "
from src.preprocessing import build_classification_pipeline
ds = build_classification_pipeline()
print(ds['train'][0])
"

# Stream a large dataset
python -c "
from src.streaming import stream_large_dataset
stream_large_dataset()
"

Key Concepts Recap

Concept	What It Is	Why It Matters
Apache Arrow	Columnar memory format used by `datasets`	Enables zero-copy reads and memory-mapped access
`load_dataset`	Universal data loader for Hub, local, and remote files	One function for any data source
Streaming	Iterate through data without downloading it all	Handle datasets larger than disk
`map(batched=True)`	Process examples in batches	10-100x faster than per-example
`num_proc`	Parallelize map across CPU cores	Utilize all available compute
`set_format("torch")`	Return PyTorch tensors from `__getitem__`	Direct integration with DataLoader
`push_to_hub`	Publish datasets to HuggingFace Hub	Share data with the community
Features/ClassLabel	Typed schema for dataset columns	Enable automatic label encoding

Next Steps

Text Embeddings & Semantic Search — Use datasets to build a search index
Fine-Tuning with PEFT — Use preprocessed datasets for fine-tuning

Datasets Mastery

TL;DR

What You'll Learn

Loading datasets from the Hub, local files, and custom scripts
Streaming mode for datasets that don't fit in memory
map, filter, select, and sort transformations
Building preprocessing pipelines with batched operations
Creating and publishing custom datasets to the Hub
Memory-efficient data handling with Apache Arrow

Why Apache Arrow Matters

Property	Value
Difficulty	Beginner
Time	~3 hours
Lines of Code	~250
Prerequisites	Basic Python, familiarity with pandas helpful

Tech Stack

Component	Technology	Why
Data Loading	`datasets`	Memory-mapped Arrow backend for zero-copy reads
Hub Integration	`huggingface_hub`	Publish and share datasets with the community
Tokenization	`transformers`	Preprocessing pipelines for model training
Python	3.10+	Type hint support

Architecture

Datasets Workflow

Data Sources

HuggingFace Hub (200K+ datasets)

Local Files (CSV/JSON/Parquet/Text)

Custom Script

Processing (Apache Arrow)

load_dataset()

map → filter → select

Zero-copy reads

Memory-mapped files

Batched processing

Output

Push to Hub (push_to_hub())

Save to Disk (save_to_disk())

Streaming mode (IterableDataset)

Project Structure

datasets-mastery/
├── src/
│   ├── __init__.py
│   ├── loading.py            # Dataset loading patterns
│   ├── transforms.py         # Map, filter, and preprocessing
│   ├── streaming.py          # Streaming for large datasets
│   ├── custom_dataset.py     # Create and publish custom datasets
│   └── preprocessing.py      # NLP preprocessing pipelines
├── data/
│   ├── sample.csv
│   └── sample.jsonl
├── examples/
│   └── full_pipeline.py
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

datasets>=2.19.0
huggingface_hub>=0.23.0
transformers>=4.40.0
pandas>=2.0.0

Step 2: Loading Datasets

src/loading.py

"""Dataset loading patterns for different sources."""

from datasets import (
    load_dataset,
    load_from_disk,
    Dataset,
    DatasetDict,
    Features,
    Value,
    ClassLabel,
)


def load_from_hub():
    """Load popular datasets from HuggingFace Hub."""

    # Load full dataset with all splits
    imdb = load_dataset("imdb")
    print(f"IMDB splits: {list(imdb.keys())}")
    print(f"Train size: {len(imdb['train']):,}")
    print(f"Features: {imdb['train'].features}")

    # Load a specific split
    train = load_dataset("imdb", split="train")
    print(f"\nSingle split shape: {train.shape}")

    # Load a subset (first 1000 examples)
    subset = load_dataset("imdb", split="train[:1000]")
    print(f"Subset size: {len(subset)}")

    # Load specific columns only
    texts_only = load_dataset(
        "imdb",
        split="train[:100]",
    ).select_columns(["text"])
    print(f"Columns: {texts_only.column_names}")

    return imdb


def load_from_local_files():
    """Load datasets from local CSV, JSON, and text files."""

    # From CSV
    csv_dataset = load_dataset("csv", data_files="data/sample.csv")
    print(f"CSV dataset: {csv_dataset}")

    # From JSON Lines
    jsonl_dataset = load_dataset("json", data_files="data/sample.jsonl")
    print(f"JSONL dataset: {jsonl_dataset}")

    # From multiple files with train/test split
    split_dataset = load_dataset(
        "csv",
        data_files={
            "train": "data/train.csv",
            "test": "data/test.csv",
        },
    )

    # From text files (one example per line)
    text_dataset = load_dataset(
        "text",
        data_files="data/corpus.txt",
    )

    return csv_dataset


def create_from_dict():
    """Create a dataset from Python dictionaries."""

    # Simple creation from dict
    dataset = Dataset.from_dict({
        "text": [
            "Great product!",
            "Terrible service.",
            "Average quality.",
        ],
        "label": [1, 0, 1],
    })
    print(f"From dict: {dataset}")

    # With explicit features/schema
    features = Features({
        "text": Value("string"),
        "label": ClassLabel(names=["negative", "positive"]),
        "score": Value("float32"),
    })

    typed_dataset = Dataset.from_dict(
        {
            "text": ["Hello", "World"],
            "label": [0, 1],
            "score": [0.3, 0.9],
        },
        features=features,
    )
    print(f"With features: {typed_dataset.features}")

    return dataset


def create_from_pandas():
    """Create a dataset from a pandas DataFrame."""
    import pandas as pd

    df = pd.DataFrame({
        "question": ["What is Python?", "What is ML?"],
        "answer": ["A programming language", "Machine learning"],
        "category": ["programming", "ai"],
    })

    dataset = Dataset.from_pandas(df)
    print(f"From pandas: {dataset}")

    return dataset

Understanding the Arrow Backend:

Why Apache Arrow?

Traditional Python (pandas)

DataFrame in RAM — all data loaded into memory. 10GB dataset = 10GB+ RAM. Copies on every operation.

HuggingFace datasets (Arrow)

Recommended

Memory-mapped Arrow file — data stays on disk. 10GB dataset ≈ 0 RAM usage. Zero-copy column access, no deserialization overhead. Load The Pile (800GB) on a laptop with 16GB RAM.

Step 3: Transformations

src/transforms.py

"""Dataset transformations: map, filter, select, sort."""

from datasets import Dataset


def map_examples(dataset: Dataset) -> Dataset:
    """Apply transformations with map()."""

    # Single example map — add a new column
    def add_word_count(example):
        example["word_count"] = len(example["text"].split())
        return example

    dataset = dataset.map(add_word_count)
    print(f"Columns after map: {dataset.column_names}")

    return dataset


def batched_map(dataset: Dataset) -> Dataset:
    """
    Batched map for higher throughput.

    Processing examples in batches is 10-100x faster than
    one-at-a-time, especially for tokenization.
    """

    def tokenize_batch(batch):
        """Process a batch of examples at once."""
        # batch["text"] is a list of strings
        # Return a dict of lists (same length as input)
        batch["text_lower"] = [t.lower() for t in batch["text"]]
        batch["char_count"] = [len(t) for t in batch["text"]]
        return batch

    dataset = dataset.map(
        tokenize_batch,
        batched=True,
        batch_size=1000,
        num_proc=4,  # Parallelize across 4 CPU cores
    )

    return dataset


def filter_examples(dataset: Dataset) -> Dataset:
    """Filter dataset based on conditions."""

    # Keep only examples with more than 10 words
    filtered = dataset.filter(
        lambda x: len(x["text"].split()) > 10
    )
    print(f"Before filter: {len(dataset)}, After: {len(filtered)}")

    # Batched filter (faster)
    filtered_batch = dataset.filter(
        lambda batch: [len(t.split()) > 10 for t in batch["text"]],
        batched=True,
    )

    return filtered


def select_and_sort(dataset: Dataset) -> Dataset:
    """Select specific examples and sort."""

    # Select by indices
    subset = dataset.select(range(100))
    print(f"Selected 100: {len(subset)}")

    # Random shuffle
    shuffled = dataset.shuffle(seed=42)

    # Sort by a column
    if "word_count" in dataset.column_names:
        sorted_ds = dataset.sort("word_count", reverse=True)
        print(f"Longest text: {sorted_ds[0]['word_count']} words")

    # Train/test split
    splits = dataset.train_test_split(test_size=0.2, seed=42)
    print(f"Train: {len(splits['train'])}, Test: {len(splits['test'])}")

    return splits


def rename_and_remove(dataset: Dataset) -> Dataset:
    """Rename and remove columns."""

    # Rename columns
    dataset = dataset.rename_column("text", "input_text")

    # Remove columns
    dataset = dataset.remove_columns(["word_count"])

    # Cast column types
    dataset = dataset.cast_column("label", "int32")

    return dataset

Map Operations Compared:

Mode	Speed	Memory	Use Case
`map(fn)`	Slow	Low per-example	Simple per-example transforms
`map(fn, batched=True)`	Fast	Moderate	Tokenization, batch processing
`map(fn, batched=True, num_proc=4)`	Fastest	Higher	CPU-bound transforms

Step 4: Streaming for Large Datasets

src/streaming.py

"""Streaming datasets for data too large to fit in memory."""

from datasets import load_dataset
from itertools import islice


def stream_large_dataset():
    """
    Load a dataset in streaming mode.

    Streaming downloads and processes data on-the-fly,
    never loading the full dataset into memory.
    """
    # Stream C4 (305GB compressed, ~750GB uncompressed)
    stream = load_dataset(
        "allenai/c4",
        "en",
        split="train",
        streaming=True,
    )

    # Peek at first 5 examples
    for i, example in enumerate(stream):
        print(f"  Example {i}: {example['text'][:80]}...")
        if i >= 4:
            break


def stream_with_transforms():
    """Apply transforms to a streaming dataset."""
    stream = load_dataset(
        "imdb",
        split="train",
        streaming=True,
    )

    # Map transforms work on streams
    processed = stream.map(
        lambda x: {"text_length": len(x["text"])}
    )

    # Filter works on streams
    long_reviews = processed.filter(
        lambda x: x["text_length"] > 1000
    )

    # Take first N examples
    samples = list(islice(long_reviews, 10))
    print(f"Got {len(samples)} long reviews")

    return samples


def stream_and_tokenize():
    """Tokenize a streaming dataset for training."""
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    stream = load_dataset(
        "wikitext",
        "wikitext-2-raw-v1",
        split="train",
        streaming=True,
    )

    def tokenize(example):
        return tokenizer(
            example["text"],
            truncation=True,
            max_length=512,
            padding="max_length",
        )

    tokenized_stream = stream.map(tokenize)

    # Use with PyTorch DataLoader
    batch = list(islice(tokenized_stream, 32))
    print(f"Batch of {len(batch)} tokenized examples")
    print(f"Keys: {list(batch[0].keys())}")

    return tokenized_stream


def interleave_datasets():
    """Combine multiple streaming datasets."""
    from datasets import interleave_datasets

    en = load_dataset(
        "oscar-corpus/OSCAR-2301",
        "en",
        split="train",
        streaming=True,
    )
    fr = load_dataset(
        "oscar-corpus/OSCAR-2301",
        "fr",
        split="train",
        streaming=True,
    )

    # Interleave with equal probability
    combined = interleave_datasets([en, fr])

    # Or with custom weights (70% English, 30% French)
    weighted = interleave_datasets(
        [en, fr],
        probabilities=[0.7, 0.3],
        seed=42,
    )

    return weighted

Streaming vs Regular Loading:

Regular vs Streaming

Regular Loading

Download ALL data (305 GB, takes hours), convert to Arrow format (hours), memory-map full dataset (750GB disk). Supports random access to any example.

Streaming

Recommended

Step 5: Custom Dataset Creation

src/custom_dataset.py

"""Create and publish custom datasets."""

from datasets import Dataset, DatasetDict, Features, Value, ClassLabel
from huggingface_hub import HfApi


def create_qa_dataset() -> DatasetDict:
    """Create a question-answering dataset from scratch."""
    train_data = {
        "question": [
            "What is machine learning?",
            "What is a neural network?",
            "What is gradient descent?",
        ],
        "answer": [
            "A field of AI that learns patterns from data.",
            "A computational model inspired by biological neurons.",
            "An optimization algorithm that minimizes loss by following gradients.",
        ],
        "category": ["fundamentals", "architecture", "optimization"],
        "difficulty": ["easy", "medium", "medium"],
    }

    test_data = {
        "question": ["What is overfitting?"],
        "answer": ["When a model memorizes training data instead of learning patterns."],
        "category": ["fundamentals"],
        "difficulty": ["easy"],
    }

    features = Features({
        "question": Value("string"),
        "answer": Value("string"),
        "category": ClassLabel(names=["fundamentals", "architecture", "optimization"]),
        "difficulty": ClassLabel(names=["easy", "medium", "hard"]),
    })

    dataset = DatasetDict({
        "train": Dataset.from_dict(train_data, features=features),
        "test": Dataset.from_dict(test_data, features=features),
    })

    return dataset


def push_dataset_to_hub(
    dataset: DatasetDict,
    repo_id: str,
    private: bool = False,
):
    """
    Push a dataset to HuggingFace Hub.

    Args:
        dataset: The DatasetDict to push
        repo_id: Hub repository ID (e.g., "username/my-dataset")
        private: Whether the repo should be private
    """
    dataset.push_to_hub(
        repo_id,
        private=private,
    )
    print(f"Dataset pushed to: https://huggingface.co/datasets/{repo_id}")


def save_and_load_locally(dataset: DatasetDict, path: str = "data/my-dataset"):
    """Save dataset to disk and reload it."""
    # Save (Arrow format — very fast)
    dataset.save_to_disk(path)
    print(f"Saved to {path}")

    # Reload
    from datasets import load_from_disk
    reloaded = load_from_disk(path)
    print(f"Reloaded: {reloaded}")

    return reloaded

Step 6: Preprocessing Pipeline

src/preprocessing.py

"""Complete NLP preprocessing pipeline with datasets."""

from datasets import load_dataset
from transformers import AutoTokenizer


def build_classification_pipeline(
    dataset_name: str = "imdb",
    model_name: str = "bert-base-uncased",
    max_length: int = 512,
):
    """
    Build a complete preprocessing pipeline for text classification.

    Steps: load → clean → tokenize → format → split
    """
    # Load
    dataset = load_dataset(dataset_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Clean: remove HTML tags, normalize whitespace
    import re

    def clean_text(example):
        text = example["text"]
        text = re.sub(r"<[^>]+>", "", text)       # Remove HTML
        text = re.sub(r"\s+", " ", text).strip()   # Normalize whitespace
        example["text"] = text
        return example

    dataset = dataset.map(clean_text)

    # Tokenize (batched for speed)
    def tokenize(batch):
        return tokenizer(
            batch["text"],
            truncation=True,
            max_length=max_length,
            padding="max_length",
        )

    dataset = dataset.map(
        tokenize,
        batched=True,
        batch_size=1000,
        remove_columns=["text"],  # Remove raw text, keep only token IDs
    )

    # Set format for PyTorch
    dataset.set_format("torch")

    print(f"Final columns: {dataset['train'].column_names}")
    print(f"Example keys: {list(dataset['train'][0].keys())}")
    print(f"input_ids shape: {dataset['train'][0]['input_ids'].shape}")

    return dataset


def build_language_modeling_pipeline(
    dataset_name: str = "wikitext",
    dataset_config: str = "wikitext-2-raw-v1",
    model_name: str = "gpt2",
    block_size: int = 128,
):
    """
    Build a preprocessing pipeline for causal language modeling.

    Concatenates all texts and splits into fixed-length blocks.
    """
    dataset = load_dataset(dataset_name, dataset_config)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Tokenize all texts
    def tokenize(batch):
        return tokenizer(batch["text"])

    tokenized = dataset.map(
        tokenize,
        batched=True,
        remove_columns=dataset["train"].column_names,
    )

    # Group texts into fixed-length blocks
    def group_texts(batch):
        # Concatenate all input_ids
        concatenated = {k: sum(batch[k], []) for k in batch.keys()}
        total_length = len(concatenated["input_ids"])

        # Truncate to multiple of block_size
        total_length = (total_length // block_size) * block_size

        # Split into chunks
        result = {
            k: [v[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, v in concatenated.items()
        }

        # Labels = input_ids (shifted by 1 during training)
        result["labels"] = result["input_ids"].copy()
        return result

    lm_dataset = tokenized.map(
        group_texts,
        batched=True,
    )

    print(f"Training examples: {len(lm_dataset['train'])}")
    print(f"Each example: {block_size} tokens")

    return lm_dataset

Understanding the Language Modeling Pipeline:

Preprocessing Pipeline Flow:

Classification Preprocessing

Raw Data{"text": "<br/>Great movie!<br/>", "label": 1}

clean_text(){"text": "Great movie!", "label": 1} — remove HTML, normalize whitespace

tokenize() batched=True{"input_ids": [101, 2307, 3185, 999, 102, 0, 0, ...], "attention_mask": [1, 1, 1, 1, 1, 0, 0, ...], "label": 1}

set_format("torch"){input_ids: tensor([101, 2307, ...]), attention_mask: tensor([1, 1, ...]), label: tensor(1)}

ReadyReady for DataLoader and training!

Running the Project

# Install dependencies
pip install -r requirements.txt

# Load and explore a dataset
python -c "
from src.loading import load_from_hub
ds = load_from_hub()
print(ds['train'][0])
"

# Run full preprocessing pipeline
python -c "
from src.preprocessing import build_classification_pipeline
ds = build_classification_pipeline()
print(ds['train'][0])
"

# Stream a large dataset
python -c "
from src.streaming import stream_large_dataset
stream_large_dataset()
"

Key Concepts Recap

Concept	What It Is	Why It Matters
Apache Arrow	Columnar memory format used by `datasets`	Enables zero-copy reads and memory-mapped access
`load_dataset`	Universal data loader for Hub, local, and remote files	One function for any data source
Streaming	Iterate through data without downloading it all	Handle datasets larger than disk
`map(batched=True)`	Process examples in batches	10-100x faster than per-example
`num_proc`	Parallelize map across CPU cores	Utilize all available compute
`set_format("torch")`	Return PyTorch tensors from `__getitem__`	Direct integration with DataLoader
`push_to_hub`	Publish datasets to HuggingFace Hub	Share data with the community
Features/ClassLabel	Typed schema for dataset columns	Enable automatic label encoding

Next Steps

Text Embeddings & Semantic Search — Use datasets to build a search index
Fine-Tuning with PEFT — Use preprocessed datasets for fine-tuning

Datasets Mastery

Datasets Mastery

What You'll Learn

Why Apache Arrow Matters

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Loading Datasets

Step 3: Transformations

Step 4: Streaming for Large Datasets

Step 5: Custom Dataset Creation

Step 6: Preprocessing Pipeline

Running the Project

Key Concepts Recap

Next Steps

On this page

Datasets Mastery

Datasets Mastery

What You'll Learn

Why Apache Arrow Matters

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Loading Datasets

Step 3: Transformations

Step 4: Streaming for Large Datasets

Step 5: Custom Dataset Creation

Step 6: Preprocessing Pipeline

Running the Project

Key Concepts Recap

Next Steps

On this page