HuggingFace EcosystemIntermediate
Image Generation with Diffusers
Generate, edit, and control images with Stable Diffusion and the diffusers library
Image Generation with Diffusers
TL;DR
Use the HuggingFace diffusers library to run Stable Diffusion for text-to-image generation, image-to-image editing, inpainting, and controlled generation with ControlNet. Learn how noise schedulers work and how to apply LoRA style adapters.
Build a complete image generation application using the diffusers library, covering text-to-image, img2img, inpainting, ControlNet, and LoRA style adapters.
What You'll Learn
- Stable Diffusion architecture and diffusion process
- Text-to-image generation with prompt engineering
- Image-to-image editing and inpainting
- Noise schedulers (DDPM, DDIM, Euler, DPM++)
- ControlNet for structure-guided generation
- LoRA style adapters for customization
- Memory optimization with
accelerate
Tech Stack
| Component | Technology |
|---|---|
| Diffusion | diffusers |
| Models | transformers |
| Optimization | accelerate |
| Image Processing | Pillow |
| Python | 3.10+ |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ STABLE DIFFUSION ARCHITECTURE │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ TEXT-TO-IMAGE PIPELINE │
│ │
│ "A cat on the moon" ──► CLIP Text Encoder ──► Text Embeddings │
│ │ │
│ ▼ │
│ Random Noise ──► ┌────────────────────────────────────────┐ │
│ (latent space) │ U-Net Denoiser │ │
│ │ │ │
│ │ Step 1: Predict noise at timestep T │ │
│ │ Step 2: Remove predicted noise │ │
│ │ Step 3: Repeat for T-1, T-2, ..., 0 │ │
│ │ │ │
│ │ (conditioned on text embeddings) │ │
│ └───────────────────────────┬────────────┘ │
│ │ │
│ ▼ │
│ VAE Decoder ──► Final Image (512×512) │
│ (latent → pixel) │
│ │
│ Why latent space? 512×512×3 = 786K values │
│ 64×64×4 = 16K values (48x smaller!) │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
image-generation/
├── src/
│ ├── __init__.py
│ ├── text_to_image.py # Basic text-to-image generation
│ ├── img2img.py # Image-to-image editing
│ ├── inpainting.py # Selective region editing
│ ├── controlnet.py # Structure-guided generation
│ ├── schedulers.py # Noise scheduler comparison
│ ├── lora_styles.py # LoRA style adapters
│ └── optimization.py # Memory and speed optimization
├── api/
│ └── main.py # FastAPI application
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
diffusers>=0.28.0
transformers>=4.40.0
accelerate>=0.30.0
torch>=2.0.0
Pillow>=10.0.0
fastapi>=0.111.0
uvicorn>=0.30.0
safetensors>=0.4.0Step 2: Text-to-Image Generation
"""Text-to-image generation with Stable Diffusion."""
import torch
from diffusers import StableDiffusionPipeline, StableDiffusionXLPipeline
from PIL import Image
def generate_image(
prompt: str,
negative_prompt: str = "low quality, blurry, distorted",
model_id: str = "stabilityai/stable-diffusion-xl-base-1.0",
num_inference_steps: int = 30,
guidance_scale: float = 7.5,
width: int = 1024,
height: int = 1024,
seed: int | None = None,
) -> Image.Image:
"""
Generate an image from a text prompt.
Args:
prompt: What to generate
negative_prompt: What to avoid (improves quality significantly)
model_id: HuggingFace model ID for Stable Diffusion
num_inference_steps: Denoising steps (more = better quality, slower)
guidance_scale: How closely to follow the prompt (7-9 is typical)
width: Output width (must be multiple of 8)
height: Output height (must be multiple of 8)
seed: Random seed for reproducibility
"""
# Select pipeline class based on model
if "xl" in model_id.lower():
pipe = StableDiffusionXLPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
)
else:
pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
use_safetensors=True,
)
pipe = pipe.to("cuda")
# Enable memory-efficient attention
pipe.enable_xformers_memory_efficient_attention()
# Set seed for reproducibility
generator = None
if seed is not None:
generator = torch.Generator(device="cuda").manual_seed(seed)
# Generate
result = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
width=width,
height=height,
generator=generator,
)
return result.images[0]
def generate_batch(
prompts: list[str],
negative_prompt: str = "low quality, blurry",
model_id: str = "runwayml/stable-diffusion-v1-5",
num_inference_steps: int = 30,
) -> list[Image.Image]:
"""Generate multiple images at once (uses more VRAM)."""
pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
).to("cuda")
result = pipe(
prompt=prompts,
negative_prompt=[negative_prompt] * len(prompts),
num_inference_steps=num_inference_steps,
)
return result.imagesGuidance Scale Explained:
┌─────────────────────────────────────────────────────────────────┐
│ CLASSIFIER-FREE GUIDANCE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ guidance_scale controls prompt adherence vs creativity: │
│ │
│ Scale = 1.0: Ignore prompt entirely (random generation) │
│ Scale = 3.0: Loosely follow prompt (creative, diverse) │
│ Scale = 7.5: Balanced (default, good quality) │
│ Scale = 12.0: Strict prompt following (may oversaturate) │
│ Scale = 20.0: Extreme adherence (artifacts, unrealistic) │
│ │
│ How it works: │
│ predicted_noise = unconditional + scale × (conditional - unconditional)
│ │
│ The U-Net runs TWICE per step: │
│ 1. Without text conditioning (unconditional) │
│ 2. With text conditioning (conditional) │
│ Then amplifies the difference by guidance_scale. │
│ │
└─────────────────────────────────────────────────────────────────┘Step 3: Noise Schedulers
"""Compare noise schedulers for diffusion models."""
from diffusers import (
DDPMScheduler,
DDIMScheduler,
EulerDiscreteScheduler,
EulerAncestralDiscreteScheduler,
DPMSolverMultistepScheduler,
StableDiffusionPipeline,
)
import torch
SCHEDULERS = {
"ddpm": DDPMScheduler,
"ddim": DDIMScheduler,
"euler": EulerDiscreteScheduler,
"euler_a": EulerAncestralDiscreteScheduler,
"dpm++": DPMSolverMultistepScheduler,
}
def compare_schedulers(
prompt: str,
model_id: str = "runwayml/stable-diffusion-v1-5",
num_steps: int = 30,
seed: int = 42,
) -> dict:
"""
Generate the same prompt with different schedulers.
Returns dict mapping scheduler name to generated image.
"""
pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
).to("cuda")
results = {}
for name, scheduler_class in SCHEDULERS.items():
pipe.scheduler = scheduler_class.from_config(
pipe.scheduler.config
)
generator = torch.Generator(device="cuda").manual_seed(seed)
image = pipe(
prompt=prompt,
num_inference_steps=num_steps,
generator=generator,
).images[0]
results[name] = image
print(f" {name}: done")
return resultsScheduler Comparison:
| Scheduler | Steps Needed | Speed | Quality | Deterministic |
|---|---|---|---|---|
| DDPM | 1000 | Slowest | Baseline | Yes |
| DDIM | 20-50 | Fast | Good | Yes |
| Euler | 20-30 | Fast | Good | Yes |
| Euler Ancestral | 20-30 | Fast | Creative | No (stochastic) |
| DPM++ 2M | 15-25 | Fastest | Best | Yes |
Step 4: Image-to-Image and Inpainting
"""Image-to-image editing with Stable Diffusion."""
import torch
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image
def edit_image(
image: Image.Image,
prompt: str,
strength: float = 0.75,
model_id: str = "runwayml/stable-diffusion-v1-5",
num_inference_steps: int = 30,
guidance_scale: float = 7.5,
) -> Image.Image:
"""
Edit an image guided by a text prompt.
Args:
image: Source image to edit
prompt: Text description of desired result
strength: How much to change (0.0 = no change, 1.0 = complete redraw)
Typical range: 0.3 (subtle edit) to 0.8 (major change)
model_id: Model to use
num_inference_steps: Denoising steps
guidance_scale: Prompt adherence strength
"""
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
).to("cuda")
# Resize image to model's expected dimensions
image = image.resize((512, 512))
result = pipe(
prompt=prompt,
image=image,
strength=strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
)
return result.images[0]"""Inpainting: edit specific regions of an image."""
import torch
from diffusers import StableDiffusionInpaintPipeline
from PIL import Image
def inpaint(
image: Image.Image,
mask: Image.Image,
prompt: str,
model_id: str = "runwayml/stable-diffusion-inpainting",
num_inference_steps: int = 30,
) -> Image.Image:
"""
Edit only the masked region of an image.
Args:
image: Original image
mask: Binary mask (white = edit, black = keep)
prompt: What to fill in the masked region
model_id: Inpainting model
num_inference_steps: Denoising steps
"""
pipe = StableDiffusionInpaintPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
).to("cuda")
image = image.resize((512, 512))
mask = mask.resize((512, 512))
result = pipe(
prompt=prompt,
image=image,
mask_image=mask,
num_inference_steps=num_inference_steps,
)
return result.images[0]Diffusion Pipeline Variants:
┌─────────────────────────────────────────────────────────────────┐
│ PIPELINE TYPES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ TEXT-TO-IMAGE │
│ "A sunset over mountains" ──► [noise] ──► [denoise] ──► image │
│ Start from pure noise │
│ │
│ IMG2IMG │
│ image + "oil painting style" ──► [add noise] ──► [denoise] ──► │
│ Start from noised version of input (strength controls noise) │
│ │
│ INPAINTING │
│ image + mask + "a red car" ──► [denoise masked region] ──► │
│ Only edit the white-masked area, keep the rest intact │
│ │
│ CONTROLNET │
│ image + control_image + "detailed photo" ──► [denoise] ──► │
│ Follow the structure of the control image (edges, pose, depth) │
│ │
└─────────────────────────────────────────────────────────────────┘Step 5: ControlNet
"""ControlNet for structure-guided image generation."""
import torch
from diffusers import (
StableDiffusionControlNetPipeline,
ControlNetModel,
UniPCMultistepScheduler,
)
from PIL import Image
import numpy as np
def generate_with_controlnet(
prompt: str,
control_image: Image.Image,
controlnet_model: str = "lllyasviel/sd-controlnet-canny",
sd_model: str = "runwayml/stable-diffusion-v1-5",
num_inference_steps: int = 30,
controlnet_conditioning_scale: float = 1.0,
) -> Image.Image:
"""
Generate an image following the structure of a control image.
Args:
prompt: Text description
control_image: Preprocessed control image (edges, pose, depth)
controlnet_model: ControlNet model for the control type
sd_model: Base Stable Diffusion model
num_inference_steps: Denoising steps
controlnet_conditioning_scale: How strongly to follow the control
(0.0 = ignore, 1.0 = strict)
"""
controlnet = ControlNetModel.from_pretrained(
controlnet_model,
torch_dtype=torch.float16,
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
sd_model,
controlnet=controlnet,
torch_dtype=torch.float16,
).to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(
pipe.scheduler.config
)
result = pipe(
prompt=prompt,
image=control_image,
num_inference_steps=num_inference_steps,
controlnet_conditioning_scale=controlnet_conditioning_scale,
)
return result.images[0]
def extract_canny_edges(
image: Image.Image,
low_threshold: int = 100,
high_threshold: int = 200,
) -> Image.Image:
"""Extract Canny edges from an image for ControlNet."""
import cv2
img_array = np.array(image)
edges = cv2.Canny(img_array, low_threshold, high_threshold)
return Image.fromarray(edges)Step 6: LoRA Style Adapters
"""Apply LoRA style adapters to Stable Diffusion."""
import torch
from diffusers import StableDiffusionPipeline
def generate_with_lora(
prompt: str,
lora_model_id: str,
base_model: str = "runwayml/stable-diffusion-v1-5",
lora_scale: float = 1.0,
num_inference_steps: int = 30,
):
"""
Generate an image with a LoRA style adapter.
LoRA adapters are small files (typically 2-50MB) that modify
the model's style without changing the full weights (2-7GB).
"""
pipe = StableDiffusionPipeline.from_pretrained(
base_model,
torch_dtype=torch.float16,
).to("cuda")
# Load LoRA weights from Hub
pipe.load_lora_weights(lora_model_id)
result = pipe(
prompt=prompt,
num_inference_steps=num_inference_steps,
cross_attention_kwargs={"scale": lora_scale},
)
# Unload LoRA to free memory
pipe.unload_lora_weights()
return result.images[0]
def merge_multiple_loras(
prompt: str,
lora_ids: list[str],
lora_scales: list[float],
base_model: str = "runwayml/stable-diffusion-v1-5",
):
"""Combine multiple LoRA adapters for blended styles."""
pipe = StableDiffusionPipeline.from_pretrained(
base_model,
torch_dtype=torch.float16,
).to("cuda")
for lora_id, scale in zip(lora_ids, lora_scales):
pipe.load_lora_weights(lora_id, adapter_name=lora_id.split("/")[-1])
# Set adapter weights
adapter_names = [lid.split("/")[-1] for lid in lora_ids]
pipe.set_adapters(adapter_names, adapter_weights=lora_scales)
result = pipe(prompt=prompt, num_inference_steps=30)
return result.images[0]Step 7: Memory Optimization
"""Memory and speed optimization for diffusion pipelines."""
import torch
from diffusers import StableDiffusionPipeline
def setup_optimized_pipeline(
model_id: str = "runwayml/stable-diffusion-v1-5",
) -> StableDiffusionPipeline:
"""Create a memory-optimized pipeline."""
pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
)
# Option 1: Standard GPU (needs ~4GB VRAM)
pipe = pipe.to("cuda")
# Option 2: Sequential CPU offload (needs ~2GB VRAM)
# pipe.enable_sequential_cpu_offload()
# Option 3: Model CPU offload (needs ~3GB VRAM, faster than sequential)
# pipe.enable_model_cpu_offload()
# Enable memory-efficient attention
# (requires xformers or PyTorch 2.0+)
pipe.enable_attention_slicing()
# Enable VAE slicing for large images
pipe.enable_vae_slicing()
return pipeMemory Optimization Options:
| Technique | VRAM Usage | Speed | Setup |
|---|---|---|---|
| Full GPU | ~7 GB | Fastest | .to("cuda") |
enable_attention_slicing | ~5 GB | Slightly slower | One line |
enable_model_cpu_offload | ~3 GB | Slower | One line |
enable_sequential_cpu_offload | ~2 GB | Slowest | One line |
torch_dtype=float16 | Half of fp32 | Same | On load |
Running the Project
# Install dependencies
pip install -r requirements.txt
# Generate an image
python -c "
from src.text_to_image import generate_image
img = generate_image('A serene lake at sunset, photorealistic', seed=42)
img.save('output.png')
"
# Compare schedulers
python -c "
from src.schedulers import compare_schedulers
results = compare_schedulers('A fantasy castle in the clouds')
for name, img in results.items():
img.save(f'scheduler_{name}.png')
"Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Latent Diffusion | Denoising in compressed latent space | 48x more efficient than pixel-space diffusion |
| Guidance Scale | Amplifies prompt conditioning | Controls creativity vs prompt adherence |
| Scheduler | Controls the denoising step schedule | DPM++ needs 15 steps vs DDPM's 1000 |
| ControlNet | Additional conditioning on structure | Generate images following edges, poses, depth |
| LoRA Adapter | Small weight deltas for style transfer | Change style with 2-50MB instead of 2-7GB |
| Negative Prompt | What to avoid in generation | Dramatically improves output quality |
| CPU Offload | Move unused model parts to CPU | Run SDXL on 4GB VRAM GPUs |
Next Steps
- Fine-Tuning with PEFT — Train your own LoRA adapters
- Production AI Workbench — Build a Gradio UI for image generation