HuggingFace EcosystemBeginner
Datasets Mastery
Load, stream, transform, and publish datasets with HuggingFace datasets library
Datasets Mastery
TL;DR
The datasets library provides memory-efficient loading, streaming, and transformation of any dataset — from Hub-hosted collections to custom CSVs. Learn load_dataset, map/filter/select operations, streaming for large datasets, and pushing your own datasets to the Hub.
Master the HuggingFace datasets library for loading, transforming, and publishing datasets at any scale.
What You'll Learn
- Loading datasets from the Hub, local files, and custom scripts
- Streaming mode for datasets that don't fit in memory
map,filter,select, andsorttransformations- Building preprocessing pipelines with batched operations
- Creating and publishing custom datasets to the Hub
- Memory-efficient data handling with Apache Arrow
Tech Stack
| Component | Technology |
|---|---|
| Data Loading | datasets |
| Hub Integration | huggingface_hub |
| Tokenization | transformers |
| Python | 3.10+ |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ DATASETS WORKFLOW │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ DATA SOURCES PROCESSING │
│ ┌────────────────┐ ┌────────────────────────────────┐ │
│ │ HuggingFace Hub│──┐ │ Apache Arrow In-Memory Format │ │
│ │ (200K+ datasets)│ │ │ │ │
│ └────────────────┘ │ load_dataset() │ ┌──────┐ ┌──────┐ ┌────────┐ │ │
│ ┌────────────────┐ ├────────────────▶│ │ map │►│filter│►│ select │ │ │
│ │ Local Files │──┤ │ └──────┘ └──────┘ └────────┘ │ │
│ │ (CSV/JSON/ │ │ │ │ │
│ │ Parquet/Text) │ │ │ Features: │ │
│ └────────────────┘ │ │ • Zero-copy reads │ │
│ ┌────────────────┐ │ │ • Memory-mapped files │ │
│ │ Custom Script │──┘ │ • Batched processing │ │
│ └────────────────┘ └───────────────┬────────────────┘ │
│ │ │
│ STREAMING (for large datasets) ▼ │
│ ┌────────────────────────────────┐ ┌────────────────────────────────┐ │
│ │ load_dataset(..., streaming=True)│ │ Push to Hub / Save to Disk │ │
│ │ Returns IterableDataset │ │ push_to_hub() / save_to_disk()│ │
│ │ • No download required │ └────────────────────────────────┘ │
│ │ • Process one example at a time │ │
│ └────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
datasets-mastery/
├── src/
│ ├── __init__.py
│ ├── loading.py # Dataset loading patterns
│ ├── transforms.py # Map, filter, and preprocessing
│ ├── streaming.py # Streaming for large datasets
│ ├── custom_dataset.py # Create and publish custom datasets
│ └── preprocessing.py # NLP preprocessing pipelines
├── data/
│ ├── sample.csv
│ └── sample.jsonl
├── examples/
│ └── full_pipeline.py
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
datasets>=2.19.0
huggingface_hub>=0.23.0
transformers>=4.40.0
pandas>=2.0.0Step 2: Loading Datasets
"""Dataset loading patterns for different sources."""
from datasets import (
load_dataset,
load_from_disk,
Dataset,
DatasetDict,
Features,
Value,
ClassLabel,
)
def load_from_hub():
"""Load popular datasets from HuggingFace Hub."""
# Load full dataset with all splits
imdb = load_dataset("imdb")
print(f"IMDB splits: {list(imdb.keys())}")
print(f"Train size: {len(imdb['train']):,}")
print(f"Features: {imdb['train'].features}")
# Load a specific split
train = load_dataset("imdb", split="train")
print(f"\nSingle split shape: {train.shape}")
# Load a subset (first 1000 examples)
subset = load_dataset("imdb", split="train[:1000]")
print(f"Subset size: {len(subset)}")
# Load specific columns only
texts_only = load_dataset(
"imdb",
split="train[:100]",
).select_columns(["text"])
print(f"Columns: {texts_only.column_names}")
return imdb
def load_from_local_files():
"""Load datasets from local CSV, JSON, and text files."""
# From CSV
csv_dataset = load_dataset("csv", data_files="data/sample.csv")
print(f"CSV dataset: {csv_dataset}")
# From JSON Lines
jsonl_dataset = load_dataset("json", data_files="data/sample.jsonl")
print(f"JSONL dataset: {jsonl_dataset}")
# From multiple files with train/test split
split_dataset = load_dataset(
"csv",
data_files={
"train": "data/train.csv",
"test": "data/test.csv",
},
)
# From text files (one example per line)
text_dataset = load_dataset(
"text",
data_files="data/corpus.txt",
)
return csv_dataset
def create_from_dict():
"""Create a dataset from Python dictionaries."""
# Simple creation from dict
dataset = Dataset.from_dict({
"text": [
"Great product!",
"Terrible service.",
"Average quality.",
],
"label": [1, 0, 1],
})
print(f"From dict: {dataset}")
# With explicit features/schema
features = Features({
"text": Value("string"),
"label": ClassLabel(names=["negative", "positive"]),
"score": Value("float32"),
})
typed_dataset = Dataset.from_dict(
{
"text": ["Hello", "World"],
"label": [0, 1],
"score": [0.3, 0.9],
},
features=features,
)
print(f"With features: {typed_dataset.features}")
return dataset
def create_from_pandas():
"""Create a dataset from a pandas DataFrame."""
import pandas as pd
df = pd.DataFrame({
"question": ["What is Python?", "What is ML?"],
"answer": ["A programming language", "Machine learning"],
"category": ["programming", "ai"],
})
dataset = Dataset.from_pandas(df)
print(f"From pandas: {dataset}")
return datasetUnderstanding the Arrow Backend:
┌─────────────────────────────────────────────────────────────────┐
│ WHY APACHE ARROW? │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Traditional Python (pandas): │
│ ┌─────────────────────────────┐ │
│ │ DataFrame in RAM │ All data loaded into memory │
│ │ 10GB dataset = 10GB+ RAM │ Copies on every operation │
│ └─────────────────────────────┘ │
│ │
│ HuggingFace datasets (Arrow): │
│ ┌─────────────────────────────┐ │
│ │ Memory-mapped Arrow file │ Data stays on disk │
│ │ 10GB dataset ≈ 0 RAM usage │ OS pages data in as needed │
│ │ Zero-copy column access │ No deserialization overhead │
│ └─────────────────────────────┘ │
│ │
│ Practical impact: │
│ • Load The Pile (800GB) on a laptop with 16GB RAM │
│ • Instant random access to any example │
│ • Column operations without loading unused columns │
│ • Cached transformations (map results saved to disk) │
│ │
└─────────────────────────────────────────────────────────────────┘Step 3: Transformations
"""Dataset transformations: map, filter, select, sort."""
from datasets import Dataset
def map_examples(dataset: Dataset) -> Dataset:
"""Apply transformations with map()."""
# Single example map — add a new column
def add_word_count(example):
example["word_count"] = len(example["text"].split())
return example
dataset = dataset.map(add_word_count)
print(f"Columns after map: {dataset.column_names}")
return dataset
def batched_map(dataset: Dataset) -> Dataset:
"""
Batched map for higher throughput.
Processing examples in batches is 10-100x faster than
one-at-a-time, especially for tokenization.
"""
def tokenize_batch(batch):
"""Process a batch of examples at once."""
# batch["text"] is a list of strings
# Return a dict of lists (same length as input)
batch["text_lower"] = [t.lower() for t in batch["text"]]
batch["char_count"] = [len(t) for t in batch["text"]]
return batch
dataset = dataset.map(
tokenize_batch,
batched=True,
batch_size=1000,
num_proc=4, # Parallelize across 4 CPU cores
)
return dataset
def filter_examples(dataset: Dataset) -> Dataset:
"""Filter dataset based on conditions."""
# Keep only examples with more than 10 words
filtered = dataset.filter(
lambda x: len(x["text"].split()) > 10
)
print(f"Before filter: {len(dataset)}, After: {len(filtered)}")
# Batched filter (faster)
filtered_batch = dataset.filter(
lambda batch: [len(t.split()) > 10 for t in batch["text"]],
batched=True,
)
return filtered
def select_and_sort(dataset: Dataset) -> Dataset:
"""Select specific examples and sort."""
# Select by indices
subset = dataset.select(range(100))
print(f"Selected 100: {len(subset)}")
# Random shuffle
shuffled = dataset.shuffle(seed=42)
# Sort by a column
if "word_count" in dataset.column_names:
sorted_ds = dataset.sort("word_count", reverse=True)
print(f"Longest text: {sorted_ds[0]['word_count']} words")
# Train/test split
splits = dataset.train_test_split(test_size=0.2, seed=42)
print(f"Train: {len(splits['train'])}, Test: {len(splits['test'])}")
return splits
def rename_and_remove(dataset: Dataset) -> Dataset:
"""Rename and remove columns."""
# Rename columns
dataset = dataset.rename_column("text", "input_text")
# Remove columns
dataset = dataset.remove_columns(["word_count"])
# Cast column types
dataset = dataset.cast_column("label", "int32")
return datasetMap Operations Compared:
| Mode | Speed | Memory | Use Case |
|---|---|---|---|
map(fn) | Slow | Low per-example | Simple per-example transforms |
map(fn, batched=True) | Fast | Moderate | Tokenization, batch processing |
map(fn, batched=True, num_proc=4) | Fastest | Higher | CPU-bound transforms |
Step 4: Streaming for Large Datasets
"""Streaming datasets for data too large to fit in memory."""
from datasets import load_dataset
from itertools import islice
def stream_large_dataset():
"""
Load a dataset in streaming mode.
Streaming downloads and processes data on-the-fly,
never loading the full dataset into memory.
"""
# Stream C4 (305GB compressed, ~750GB uncompressed)
stream = load_dataset(
"allenai/c4",
"en",
split="train",
streaming=True,
)
# Peek at first 5 examples
for i, example in enumerate(stream):
print(f" Example {i}: {example['text'][:80]}...")
if i >= 4:
break
def stream_with_transforms():
"""Apply transforms to a streaming dataset."""
stream = load_dataset(
"imdb",
split="train",
streaming=True,
)
# Map transforms work on streams
processed = stream.map(
lambda x: {"text_length": len(x["text"])}
)
# Filter works on streams
long_reviews = processed.filter(
lambda x: x["text_length"] > 1000
)
# Take first N examples
samples = list(islice(long_reviews, 10))
print(f"Got {len(samples)} long reviews")
return samples
def stream_and_tokenize():
"""Tokenize a streaming dataset for training."""
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
stream = load_dataset(
"wikitext",
"wikitext-2-raw-v1",
split="train",
streaming=True,
)
def tokenize(example):
return tokenizer(
example["text"],
truncation=True,
max_length=512,
padding="max_length",
)
tokenized_stream = stream.map(tokenize)
# Use with PyTorch DataLoader
batch = list(islice(tokenized_stream, 32))
print(f"Batch of {len(batch)} tokenized examples")
print(f"Keys: {list(batch[0].keys())}")
return tokenized_stream
def interleave_datasets():
"""Combine multiple streaming datasets."""
from datasets import interleave_datasets
en = load_dataset(
"oscar-corpus/OSCAR-2301",
"en",
split="train",
streaming=True,
)
fr = load_dataset(
"oscar-corpus/OSCAR-2301",
"fr",
split="train",
streaming=True,
)
# Interleave with equal probability
combined = interleave_datasets([en, fr])
# Or with custom weights (70% English, 30% French)
weighted = interleave_datasets(
[en, fr],
probabilities=[0.7, 0.3],
seed=42,
)
return weightedStreaming vs Regular Loading:
┌─────────────────────────────────────────────────────────────────┐
│ REGULAR vs STREAMING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ REGULAR: load_dataset("c4", split="train") │
│ ┌─────────────────────────────────────────────────┐ │
│ │ 1. Download ALL data (305 GB) ⏳ hours │ │
│ │ 2. Convert to Arrow format ⏳ hours │ │
│ │ 3. Memory-map full dataset 💾 750GB │ │
│ │ 4. Random access to any example ✅ fast │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ STREAMING: load_dataset("c4", split="train", streaming=True) │
│ ┌─────────────────────────────────────────────────┐ │
│ │ 1. No download upfront ✅ instant│ │
│ │ 2. Fetch examples on-demand 💾 ~0 MB │ │
│ │ 3. Process sequentially ✅ works │ │
│ │ 4. No random access ❌ seq only│ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Use streaming when: │
│ • Dataset is too large for disk │
│ • You only need one pass through the data │
│ • You want to start processing immediately │
│ │
└─────────────────────────────────────────────────────────────────┘Step 5: Custom Dataset Creation
"""Create and publish custom datasets."""
from datasets import Dataset, DatasetDict, Features, Value, ClassLabel
from huggingface_hub import HfApi
def create_qa_dataset() -> DatasetDict:
"""Create a question-answering dataset from scratch."""
train_data = {
"question": [
"What is machine learning?",
"What is a neural network?",
"What is gradient descent?",
],
"answer": [
"A field of AI that learns patterns from data.",
"A computational model inspired by biological neurons.",
"An optimization algorithm that minimizes loss by following gradients.",
],
"category": ["fundamentals", "architecture", "optimization"],
"difficulty": ["easy", "medium", "medium"],
}
test_data = {
"question": ["What is overfitting?"],
"answer": ["When a model memorizes training data instead of learning patterns."],
"category": ["fundamentals"],
"difficulty": ["easy"],
}
features = Features({
"question": Value("string"),
"answer": Value("string"),
"category": ClassLabel(names=["fundamentals", "architecture", "optimization"]),
"difficulty": ClassLabel(names=["easy", "medium", "hard"]),
})
dataset = DatasetDict({
"train": Dataset.from_dict(train_data, features=features),
"test": Dataset.from_dict(test_data, features=features),
})
return dataset
def push_dataset_to_hub(
dataset: DatasetDict,
repo_id: str,
private: bool = False,
):
"""
Push a dataset to HuggingFace Hub.
Args:
dataset: The DatasetDict to push
repo_id: Hub repository ID (e.g., "username/my-dataset")
private: Whether the repo should be private
"""
dataset.push_to_hub(
repo_id,
private=private,
)
print(f"Dataset pushed to: https://huggingface.co/datasets/{repo_id}")
def save_and_load_locally(dataset: DatasetDict, path: str = "data/my-dataset"):
"""Save dataset to disk and reload it."""
# Save (Arrow format — very fast)
dataset.save_to_disk(path)
print(f"Saved to {path}")
# Reload
from datasets import load_from_disk
reloaded = load_from_disk(path)
print(f"Reloaded: {reloaded}")
return reloadedStep 6: Preprocessing Pipeline
"""Complete NLP preprocessing pipeline with datasets."""
from datasets import load_dataset
from transformers import AutoTokenizer
def build_classification_pipeline(
dataset_name: str = "imdb",
model_name: str = "bert-base-uncased",
max_length: int = 512,
):
"""
Build a complete preprocessing pipeline for text classification.
Steps: load → clean → tokenize → format → split
"""
# Load
dataset = load_dataset(dataset_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Clean: remove HTML tags, normalize whitespace
import re
def clean_text(example):
text = example["text"]
text = re.sub(r"<[^>]+>", "", text) # Remove HTML
text = re.sub(r"\s+", " ", text).strip() # Normalize whitespace
example["text"] = text
return example
dataset = dataset.map(clean_text)
# Tokenize (batched for speed)
def tokenize(batch):
return tokenizer(
batch["text"],
truncation=True,
max_length=max_length,
padding="max_length",
)
dataset = dataset.map(
tokenize,
batched=True,
batch_size=1000,
remove_columns=["text"], # Remove raw text, keep only token IDs
)
# Set format for PyTorch
dataset.set_format("torch")
print(f"Final columns: {dataset['train'].column_names}")
print(f"Example keys: {list(dataset['train'][0].keys())}")
print(f"input_ids shape: {dataset['train'][0]['input_ids'].shape}")
return dataset
def build_language_modeling_pipeline(
dataset_name: str = "wikitext",
dataset_config: str = "wikitext-2-raw-v1",
model_name: str = "gpt2",
block_size: int = 128,
):
"""
Build a preprocessing pipeline for causal language modeling.
Concatenates all texts and splits into fixed-length blocks.
"""
dataset = load_dataset(dataset_name, dataset_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Tokenize all texts
def tokenize(batch):
return tokenizer(batch["text"])
tokenized = dataset.map(
tokenize,
batched=True,
remove_columns=dataset["train"].column_names,
)
# Group texts into fixed-length blocks
def group_texts(batch):
# Concatenate all input_ids
concatenated = {k: sum(batch[k], []) for k in batch.keys()}
total_length = len(concatenated["input_ids"])
# Truncate to multiple of block_size
total_length = (total_length // block_size) * block_size
# Split into chunks
result = {
k: [v[i : i + block_size] for i in range(0, total_length, block_size)]
for k, v in concatenated.items()
}
# Labels = input_ids (shifted by 1 during training)
result["labels"] = result["input_ids"].copy()
return result
lm_dataset = tokenized.map(
group_texts,
batched=True,
)
print(f"Training examples: {len(lm_dataset['train'])}")
print(f"Each example: {block_size} tokens")
return lm_datasetPreprocessing Pipeline Flow:
┌─────────────────────────────────────────────────────────────────┐
│ CLASSIFICATION PREPROCESSING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Raw Data │
│ {"text": "<br/>Great movie!<br/>", "label": 1} │
│ │ │
│ ▼ clean_text() │
│ {"text": "Great movie!", "label": 1} │
│ │ │
│ ▼ tokenize() batched=True │
│ {"input_ids": [101, 2307, 3185, 999, 102, 0, 0, ...], │
│ "attention_mask": [1, 1, 1, 1, 1, 0, 0, ...], │
│ "label": 1} │
│ │ │
│ ▼ set_format("torch") │
│ {input_ids: tensor([101, 2307, ...]), │
│ attention_mask: tensor([1, 1, ...]), │
│ label: tensor(1)} │
│ │ │
│ ▼ Ready for DataLoader and training! │
│ │
└─────────────────────────────────────────────────────────────────┘Running the Project
# Install dependencies
pip install -r requirements.txt
# Load and explore a dataset
python -c "
from src.loading import load_from_hub
ds = load_from_hub()
print(ds['train'][0])
"
# Run full preprocessing pipeline
python -c "
from src.preprocessing import build_classification_pipeline
ds = build_classification_pipeline()
print(ds['train'][0])
"
# Stream a large dataset
python -c "
from src.streaming import stream_large_dataset
stream_large_dataset()
"Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Apache Arrow | Columnar memory format used by datasets | Enables zero-copy reads and memory-mapped access |
load_dataset | Universal data loader for Hub, local, and remote files | One function for any data source |
| Streaming | Iterate through data without downloading it all | Handle datasets larger than disk |
map(batched=True) | Process examples in batches | 10-100x faster than per-example |
num_proc | Parallelize map across CPU cores | Utilize all available compute |
set_format("torch") | Return PyTorch tensors from __getitem__ | Direct integration with DataLoader |
push_to_hub | Publish datasets to HuggingFace Hub | Share data with the community |
| Features/ClassLabel | Typed schema for dataset columns | Enable automatic label encoding |
Next Steps
- Text Embeddings & Semantic Search — Use datasets to build a search index
- Fine-Tuning with PEFT — Use preprocessed datasets for fine-tuning