Generate, edit, and control images with Stable Diffusion and the diffusers library

Image Generation with Diffusers

TL;DR

Use the HuggingFace diffusers library to run Stable Diffusion for text-to-image generation, image-to-image editing, inpainting, and controlled generation with ControlNet. Learn how noise schedulers work and how to apply LoRA style adapters.

What You'll Learn

Stable Diffusion architecture and diffusion process
Text-to-image generation with prompt engineering
Image-to-image editing and inpainting
Noise schedulers (DDPM, DDIM, Euler, DPM++)
ControlNet for structure-guided generation
LoRA style adapters for customization
Memory optimization with accelerate

Why Diffusion Models Changed Image Generation

Before diffusion models, image generation relied on GANs (Generative Adversarial Networks) -- notoriously unstable to train, prone to mode collapse, and limited to specific domains. Diffusion models (the math behind Stable Diffusion, DALL-E 3, and Midjourney) work by learning to reverse a noise-adding process, which is far more stable and produces higher-quality results. The diffusers library provides a modular, pipeline-based API that lets you swap schedulers, models, and conditioning without rewriting code. Understanding diffusers is essential for any project involving image generation, editing, or multi-modal AI.

Property	Value
Difficulty	Intermediate
Time	~6 hours
Lines of Code	~350
Prerequisites	Pipelines & Hub, GPU access recommended

Tech Stack

Component	Technology	Why
Diffusion	`diffusers`	Modular pipelines for all diffusion tasks
Models	`transformers`	CLIP text encoder for prompt understanding
Optimization	`accelerate`	CPU offload and memory management
Image Processing	`Pillow`	Image I/O and manipulation
Python	3.10+	Type hint support

Architecture

Stable Diffusion Architecture — Text-to-Image

CLIP Text Encoder"A cat on the moon" → Text Embeddings

U-Net DenoiserRandom noise (latent space) + text embeddings → iteratively predict and remove noise at each timestep T, T-1, ..., 0

VAE DecoderLatent → pixel space → Final Image (512x512)

Why latent space? 512x512x3 = 786K values vs 64x64x4 = 16K values (48x smaller!)

Project Structure

image-generation/
├── src/
│   ├── __init__.py
│   ├── text_to_image.py       # Basic text-to-image generation
│   ├── img2img.py             # Image-to-image editing
│   ├── inpainting.py          # Selective region editing
│   ├── controlnet.py          # Structure-guided generation
│   ├── schedulers.py          # Noise scheduler comparison
│   ├── lora_styles.py         # LoRA style adapters
│   └── optimization.py        # Memory and speed optimization
├── api/
│   └── main.py                # FastAPI application
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

diffusers>=0.28.0
transformers>=4.40.0
accelerate>=0.30.0
torch>=2.0.0
Pillow>=10.0.0
fastapi>=0.111.0
uvicorn>=0.30.0
safetensors>=0.4.0

Step 2: Text-to-Image Generation

src/text_to_image.py

"""Text-to-image generation with Stable Diffusion."""

import torch
from diffusers import StableDiffusionPipeline, StableDiffusionXLPipeline
from PIL import Image


def generate_image(
    prompt: str,
    negative_prompt: str = "low quality, blurry, distorted",
    model_id: str = "stabilityai/stable-diffusion-xl-base-1.0",
    num_inference_steps: int = 30,
    guidance_scale: float = 7.5,
    width: int = 1024,
    height: int = 1024,
    seed: int | None = None,
) -> Image.Image:
    """
    Generate an image from a text prompt.

    Args:
        prompt: What to generate
        negative_prompt: What to avoid (improves quality significantly)
        model_id: HuggingFace model ID for Stable Diffusion
        num_inference_steps: Denoising steps (more = better quality, slower)
        guidance_scale: How closely to follow the prompt (7-9 is typical)
        width: Output width (must be multiple of 8)
        height: Output height (must be multiple of 8)
        seed: Random seed for reproducibility
    """
    # Select pipeline class based on model
    if "xl" in model_id.lower():
        pipe = StableDiffusionXLPipeline.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
        )
    else:
        pipe = StableDiffusionPipeline.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            use_safetensors=True,
        )

    pipe = pipe.to("cuda")

    # Enable memory-efficient attention
    pipe.enable_xformers_memory_efficient_attention()

    # Set seed for reproducibility
    generator = None
    if seed is not None:
        generator = torch.Generator(device="cuda").manual_seed(seed)

    # Generate
    result = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
        width=width,
        height=height,
        generator=generator,
    )

    return result.images[0]


def generate_batch(
    prompts: list[str],
    negative_prompt: str = "low quality, blurry",
    model_id: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
    num_inference_steps: int = 30,
) -> list[Image.Image]:
    """Generate multiple images at once (uses more VRAM)."""
    pipe = StableDiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    ).to("cuda")

    result = pipe(
        prompt=prompts,
        negative_prompt=[negative_prompt] * len(prompts),
        num_inference_steps=num_inference_steps,
    )

    return result.images

Understanding the Pipeline Selection:

The generate_image function checks for "xl" in the model ID to choose between StableDiffusionPipeline and StableDiffusionXLPipeline -- these are different architectures with different text encoders and U-Net sizes. The variant="fp16" flag downloads the half-precision weights directly (half the download size). The use_safetensors=True flag ensures you load the safe format instead of pickle (which can execute arbitrary code). The enable_xformers_memory_efficient_attention() call replaces the standard attention implementation with xFormers' memory-efficient version, reducing VRAM by ~30% with no quality loss.

Guidance Scale Explained:

Classifier-Free Guidance Scale

Scale = 1.0

Ignore prompt entirely (random generation)

Scale = 3.0

Loosely follow prompt (creative, diverse)

Scale = 7.5

Recommended

Balanced (default, good quality)

Scale = 12.0

Strict prompt following (may oversaturate)

Scale = 20.0

Extreme adherence (artifacts, unrealistic)

How it works: predicted_noise = unconditional + scale x (conditional - unconditional). The U-Net runs TWICE per step: (1) without text conditioning, (2) with text conditioning. Then amplifies the difference by guidance_scale.

Step 3: Noise Schedulers

src/schedulers.py

"""Compare noise schedulers for diffusion models."""

from diffusers import (
    DDPMScheduler,
    DDIMScheduler,
    EulerDiscreteScheduler,
    EulerAncestralDiscreteScheduler,
    DPMSolverMultistepScheduler,
    StableDiffusionPipeline,
)
import torch


SCHEDULERS = {
    "ddpm": DDPMScheduler,
    "ddim": DDIMScheduler,
    "euler": EulerDiscreteScheduler,
    "euler_a": EulerAncestralDiscreteScheduler,
    "dpm++": DPMSolverMultistepScheduler,
}


def compare_schedulers(
    prompt: str,
    model_id: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
    num_steps: int = 30,
    seed: int = 42,
) -> dict:
    """
    Generate the same prompt with different schedulers.

    Returns dict mapping scheduler name to generated image.
    """
    pipe = StableDiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    ).to("cuda")

    results = {}

    for name, scheduler_class in SCHEDULERS.items():
        pipe.scheduler = scheduler_class.from_config(
            pipe.scheduler.config
        )

        generator = torch.Generator(device="cuda").manual_seed(seed)

        image = pipe(
            prompt=prompt,
            num_inference_steps=num_steps,
            generator=generator,
        ).images[0]

        results[name] = image
        print(f"  {name}: done")

    return results

Scheduler Comparison:

Scheduler	Steps Needed	Speed	Quality	Deterministic
DDPM	1000	Slowest	Baseline	Yes
DDIM	20-50	Fast	Good	Yes
Euler	20-30	Fast	Good	Yes
Euler Ancestral	20-30	Fast	Creative	No (stochastic)
DPM++ 2M	15-25	Fastest	Best	Yes

Step 4: Image-to-Image and Inpainting

src/img2img.py

"""Image-to-image editing with Stable Diffusion."""

import torch
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image


def edit_image(
    image: Image.Image,
    prompt: str,
    strength: float = 0.75,
    model_id: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
    num_inference_steps: int = 30,
    guidance_scale: float = 7.5,
) -> Image.Image:
    """
    Edit an image guided by a text prompt.

    Args:
        image: Source image to edit
        prompt: Text description of desired result
        strength: How much to change (0.0 = no change, 1.0 = complete redraw)
                  Typical range: 0.3 (subtle edit) to 0.8 (major change)
        model_id: Model to use
        num_inference_steps: Denoising steps
        guidance_scale: Prompt adherence strength
    """
    pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    ).to("cuda")

    # Resize image to model's expected dimensions
    image = image.resize((512, 512))

    result = pipe(
        prompt=prompt,
        image=image,
        strength=strength,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
    )

    return result.images[0]

src/inpainting.py

"""Inpainting: edit specific regions of an image."""

import torch
from diffusers import StableDiffusionInpaintPipeline
from PIL import Image


def inpaint(
    image: Image.Image,
    mask: Image.Image,
    prompt: str,
    model_id: str = "stable-diffusion-v1-5/stable-diffusion-inpainting",
    num_inference_steps: int = 30,
) -> Image.Image:
    """
    Edit only the masked region of an image.

    Args:
        image: Original image
        mask: Binary mask (white = edit, black = keep)
        prompt: What to fill in the masked region
        model_id: Inpainting model
        num_inference_steps: Denoising steps
    """
    pipe = StableDiffusionInpaintPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    ).to("cuda")

    image = image.resize((512, 512))
    mask = mask.resize((512, 512))

    result = pipe(
        prompt=prompt,
        image=image,
        mask_image=mask,
        num_inference_steps=num_inference_steps,
    )

    return result.images[0]

Diffusion Pipeline Variants:

Diffusion Pipeline Types

Text-to-Image

Prompt → [noise] → [denoise] → image. Start from pure noise.

Img2Img

Image + prompt → [add noise] → [denoise] → edited image. Start from noised version of input (strength controls noise amount).

Inpainting

Image + mask + prompt → [denoise masked region] → result. Only edit the white-masked area, keep the rest intact.

ControlNet

Image + control_image + prompt → [denoise] → result. Follow the structure of the control image (edges, pose, depth).

Step 5: ControlNet

src/controlnet.py

"""ControlNet for structure-guided image generation."""

import torch
from diffusers import (
    StableDiffusionControlNetPipeline,
    ControlNetModel,
    UniPCMultistepScheduler,
)
from PIL import Image
import numpy as np


def generate_with_controlnet(
    prompt: str,
    control_image: Image.Image,
    controlnet_model: str = "lllyasviel/sd-controlnet-canny",
    sd_model: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
    num_inference_steps: int = 30,
    controlnet_conditioning_scale: float = 1.0,
) -> Image.Image:
    """
    Generate an image following the structure of a control image.

    Args:
        prompt: Text description
        control_image: Preprocessed control image (edges, pose, depth)
        controlnet_model: ControlNet model for the control type
        sd_model: Base Stable Diffusion model
        num_inference_steps: Denoising steps
        controlnet_conditioning_scale: How strongly to follow the control
                                       (0.0 = ignore, 1.0 = strict)
    """
    controlnet = ControlNetModel.from_pretrained(
        controlnet_model,
        torch_dtype=torch.float16,
    )

    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        sd_model,
        controlnet=controlnet,
        torch_dtype=torch.float16,
    ).to("cuda")

    pipe.scheduler = UniPCMultistepScheduler.from_config(
        pipe.scheduler.config
    )

    result = pipe(
        prompt=prompt,
        image=control_image,
        num_inference_steps=num_inference_steps,
        controlnet_conditioning_scale=controlnet_conditioning_scale,
    )

    return result.images[0]


def extract_canny_edges(
    image: Image.Image,
    low_threshold: int = 100,
    high_threshold: int = 200,
) -> Image.Image:
    """Extract Canny edges from an image for ControlNet."""
    import cv2

    img_array = np.array(image)
    edges = cv2.Canny(img_array, low_threshold, high_threshold)
    return Image.fromarray(edges)

Step 6: LoRA Style Adapters

src/lora_styles.py

"""Apply LoRA style adapters to Stable Diffusion."""

import torch
from diffusers import StableDiffusionPipeline


def generate_with_lora(
    prompt: str,
    lora_model_id: str,
    base_model: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
    lora_scale: float = 1.0,
    num_inference_steps: int = 30,
):
    """
    Generate an image with a LoRA style adapter.

    LoRA adapters are small files (typically 2-50MB) that modify
    the model's style without changing the full weights (2-7GB).
    """
    pipe = StableDiffusionPipeline.from_pretrained(
        base_model,
        torch_dtype=torch.float16,
    ).to("cuda")

    # Load LoRA weights from Hub
    pipe.load_lora_weights(lora_model_id)

    result = pipe(
        prompt=prompt,
        num_inference_steps=num_inference_steps,
        cross_attention_kwargs={"scale": lora_scale},
    )

    # Unload LoRA to free memory
    pipe.unload_lora_weights()

    return result.images[0]


def merge_multiple_loras(
    prompt: str,
    lora_ids: list[str],
    lora_scales: list[float],
    base_model: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
):
    """Combine multiple LoRA adapters for blended styles."""
    pipe = StableDiffusionPipeline.from_pretrained(
        base_model,
        torch_dtype=torch.float16,
    ).to("cuda")

    for lora_id, scale in zip(lora_ids, lora_scales):
        pipe.load_lora_weights(lora_id, adapter_name=lora_id.split("/")[-1])

    # Set adapter weights
    adapter_names = [lid.split("/")[-1] for lid in lora_ids]
    pipe.set_adapters(adapter_names, adapter_weights=lora_scales)

    result = pipe(prompt=prompt, num_inference_steps=30)
    return result.images[0]

Step 7: Memory Optimization

src/optimization.py

"""Memory and speed optimization for diffusion pipelines."""

import torch
from diffusers import StableDiffusionPipeline


def setup_optimized_pipeline(
    model_id: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
) -> StableDiffusionPipeline:
    """Create a memory-optimized pipeline."""
    pipe = StableDiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    )

    # Option 1: Standard GPU (needs ~4GB VRAM)
    pipe = pipe.to("cuda")

    # Option 2: Sequential CPU offload (needs ~2GB VRAM)
    # pipe.enable_sequential_cpu_offload()

    # Option 3: Model CPU offload (needs ~3GB VRAM, faster than sequential)
    # pipe.enable_model_cpu_offload()

    # Enable memory-efficient attention
    # (requires xformers or PyTorch 2.0+)
    pipe.enable_attention_slicing()

    # Enable VAE slicing for large images
    pipe.enable_vae_slicing()

    return pipe

Memory Optimization Options:

Technique	VRAM Usage	Speed	Setup
Full GPU	~7 GB	Fastest	`.to("cuda")`
`enable_attention_slicing`	~5 GB	Slightly slower	One line
`enable_model_cpu_offload`	~3 GB	Slower	One line
`enable_sequential_cpu_offload`	~2 GB	Slowest	One line
`torch_dtype=float16`	Half of fp32	Same	On load

Running the Project

# Install dependencies
pip install -r requirements.txt

# Generate an image
python -c "
from src.text_to_image import generate_image
img = generate_image('A serene lake at sunset, photorealistic', seed=42)
img.save('output.png')
"

# Compare schedulers
python -c "
from src.schedulers import compare_schedulers
results = compare_schedulers('A fantasy castle in the clouds')
for name, img in results.items():
    img.save(f'scheduler_{name}.png')
"

Key Concepts Recap

Concept	What It Is	Why It Matters
Latent Diffusion	Denoising in compressed latent space	48x more efficient than pixel-space diffusion
Guidance Scale	Amplifies prompt conditioning	Controls creativity vs prompt adherence
Scheduler	Controls the denoising step schedule	DPM++ needs 15 steps vs DDPM's 1000
ControlNet	Additional conditioning on structure	Generate images following edges, poses, depth
LoRA Adapter	Small weight deltas for style transfer	Change style with 2-50MB instead of 2-7GB
Negative Prompt	What to avoid in generation	Dramatically improves output quality
CPU Offload	Move unused model parts to CPU	Run SDXL on 4GB VRAM GPUs

Next Steps

Fine-Tuning with PEFT — Train your own LoRA adapters
Production AI Workbench — Build a Gradio UI for image generation

Image Generation with Diffusers

TL;DR

What You'll Learn

Stable Diffusion architecture and diffusion process
Text-to-image generation with prompt engineering
Image-to-image editing and inpainting
Noise schedulers (DDPM, DDIM, Euler, DPM++)
ControlNet for structure-guided generation
LoRA style adapters for customization
Memory optimization with accelerate

Why Diffusion Models Changed Image Generation

Property	Value
Difficulty	Intermediate
Time	~6 hours
Lines of Code	~350
Prerequisites	Pipelines & Hub, GPU access recommended

Tech Stack

Component	Technology	Why
Diffusion	`diffusers`	Modular pipelines for all diffusion tasks
Models	`transformers`	CLIP text encoder for prompt understanding
Optimization	`accelerate`	CPU offload and memory management
Image Processing	`Pillow`	Image I/O and manipulation
Python	3.10+	Type hint support

Architecture

Stable Diffusion Architecture — Text-to-Image

CLIP Text Encoder"A cat on the moon" → Text Embeddings

U-Net DenoiserRandom noise (latent space) + text embeddings → iteratively predict and remove noise at each timestep T, T-1, ..., 0

VAE DecoderLatent → pixel space → Final Image (512x512)

Why latent space? 512x512x3 = 786K values vs 64x64x4 = 16K values (48x smaller!)

Project Structure

image-generation/
├── src/
│   ├── __init__.py
│   ├── text_to_image.py       # Basic text-to-image generation
│   ├── img2img.py             # Image-to-image editing
│   ├── inpainting.py          # Selective region editing
│   ├── controlnet.py          # Structure-guided generation
│   ├── schedulers.py          # Noise scheduler comparison
│   ├── lora_styles.py         # LoRA style adapters
│   └── optimization.py        # Memory and speed optimization
├── api/
│   └── main.py                # FastAPI application
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

diffusers>=0.28.0
transformers>=4.40.0
accelerate>=0.30.0
torch>=2.0.0
Pillow>=10.0.0
fastapi>=0.111.0
uvicorn>=0.30.0
safetensors>=0.4.0

Step 2: Text-to-Image Generation

src/text_to_image.py

"""Text-to-image generation with Stable Diffusion."""

import torch
from diffusers import StableDiffusionPipeline, StableDiffusionXLPipeline
from PIL import Image


def generate_image(
    prompt: str,
    negative_prompt: str = "low quality, blurry, distorted",
    model_id: str = "stabilityai/stable-diffusion-xl-base-1.0",
    num_inference_steps: int = 30,
    guidance_scale: float = 7.5,
    width: int = 1024,
    height: int = 1024,
    seed: int | None = None,
) -> Image.Image:
    """
    Generate an image from a text prompt.

    Args:
        prompt: What to generate
        negative_prompt: What to avoid (improves quality significantly)
        model_id: HuggingFace model ID for Stable Diffusion
        num_inference_steps: Denoising steps (more = better quality, slower)
        guidance_scale: How closely to follow the prompt (7-9 is typical)
        width: Output width (must be multiple of 8)
        height: Output height (must be multiple of 8)
        seed: Random seed for reproducibility
    """
    # Select pipeline class based on model
    if "xl" in model_id.lower():
        pipe = StableDiffusionXLPipeline.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
        )
    else:
        pipe = StableDiffusionPipeline.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            use_safetensors=True,
        )

    pipe = pipe.to("cuda")

    # Enable memory-efficient attention
    pipe.enable_xformers_memory_efficient_attention()

    # Set seed for reproducibility
    generator = None
    if seed is not None:
        generator = torch.Generator(device="cuda").manual_seed(seed)

    # Generate
    result = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
        width=width,
        height=height,
        generator=generator,
    )

    return result.images[0]


def generate_batch(
    prompts: list[str],
    negative_prompt: str = "low quality, blurry",
    model_id: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
    num_inference_steps: int = 30,
) -> list[Image.Image]:
    """Generate multiple images at once (uses more VRAM)."""
    pipe = StableDiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    ).to("cuda")

    result = pipe(
        prompt=prompts,
        negative_prompt=[negative_prompt] * len(prompts),
        num_inference_steps=num_inference_steps,
    )

    return result.images

Understanding the Pipeline Selection:

Guidance Scale Explained:

Classifier-Free Guidance Scale

Scale = 1.0

Ignore prompt entirely (random generation)

Scale = 3.0

Loosely follow prompt (creative, diverse)

Scale = 7.5

Recommended

Balanced (default, good quality)

Scale = 12.0

Strict prompt following (may oversaturate)

Scale = 20.0

Extreme adherence (artifacts, unrealistic)

Step 3: Noise Schedulers

src/schedulers.py

"""Compare noise schedulers for diffusion models."""

from diffusers import (
    DDPMScheduler,
    DDIMScheduler,
    EulerDiscreteScheduler,
    EulerAncestralDiscreteScheduler,
    DPMSolverMultistepScheduler,
    StableDiffusionPipeline,
)
import torch


SCHEDULERS = {
    "ddpm": DDPMScheduler,
    "ddim": DDIMScheduler,
    "euler": EulerDiscreteScheduler,
    "euler_a": EulerAncestralDiscreteScheduler,
    "dpm++": DPMSolverMultistepScheduler,
}


def compare_schedulers(
    prompt: str,
    model_id: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
    num_steps: int = 30,
    seed: int = 42,
) -> dict:
    """
    Generate the same prompt with different schedulers.

    Returns dict mapping scheduler name to generated image.
    """
    pipe = StableDiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    ).to("cuda")

    results = {}

    for name, scheduler_class in SCHEDULERS.items():
        pipe.scheduler = scheduler_class.from_config(
            pipe.scheduler.config
        )

        generator = torch.Generator(device="cuda").manual_seed(seed)

        image = pipe(
            prompt=prompt,
            num_inference_steps=num_steps,
            generator=generator,
        ).images[0]

        results[name] = image
        print(f"  {name}: done")

    return results

Scheduler Comparison:

Scheduler	Steps Needed	Speed	Quality	Deterministic
DDPM	1000	Slowest	Baseline	Yes
DDIM	20-50	Fast	Good	Yes
Euler	20-30	Fast	Good	Yes
Euler Ancestral	20-30	Fast	Creative	No (stochastic)
DPM++ 2M	15-25	Fastest	Best	Yes

Step 4: Image-to-Image and Inpainting

src/img2img.py

"""Image-to-image editing with Stable Diffusion."""

import torch
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image


def edit_image(
    image: Image.Image,
    prompt: str,
    strength: float = 0.75,
    model_id: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
    num_inference_steps: int = 30,
    guidance_scale: float = 7.5,
) -> Image.Image:
    """
    Edit an image guided by a text prompt.

    Args:
        image: Source image to edit
        prompt: Text description of desired result
        strength: How much to change (0.0 = no change, 1.0 = complete redraw)
                  Typical range: 0.3 (subtle edit) to 0.8 (major change)
        model_id: Model to use
        num_inference_steps: Denoising steps
        guidance_scale: Prompt adherence strength
    """
    pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    ).to("cuda")

    # Resize image to model's expected dimensions
    image = image.resize((512, 512))

    result = pipe(
        prompt=prompt,
        image=image,
        strength=strength,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
    )

    return result.images[0]

src/inpainting.py

"""Inpainting: edit specific regions of an image."""

import torch
from diffusers import StableDiffusionInpaintPipeline
from PIL import Image


def inpaint(
    image: Image.Image,
    mask: Image.Image,
    prompt: str,
    model_id: str = "stable-diffusion-v1-5/stable-diffusion-inpainting",
    num_inference_steps: int = 30,
) -> Image.Image:
    """
    Edit only the masked region of an image.

    Args:
        image: Original image
        mask: Binary mask (white = edit, black = keep)
        prompt: What to fill in the masked region
        model_id: Inpainting model
        num_inference_steps: Denoising steps
    """
    pipe = StableDiffusionInpaintPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    ).to("cuda")

    image = image.resize((512, 512))
    mask = mask.resize((512, 512))

    result = pipe(
        prompt=prompt,
        image=image,
        mask_image=mask,
        num_inference_steps=num_inference_steps,
    )

    return result.images[0]

Diffusion Pipeline Variants:

Diffusion Pipeline Types

Text-to-Image

Prompt → [noise] → [denoise] → image. Start from pure noise.

Img2Img

Image + prompt → [add noise] → [denoise] → edited image. Start from noised version of input (strength controls noise amount).

Inpainting

Image + mask + prompt → [denoise masked region] → result. Only edit the white-masked area, keep the rest intact.

ControlNet

Image + control_image + prompt → [denoise] → result. Follow the structure of the control image (edges, pose, depth).

Step 5: ControlNet

src/controlnet.py

"""ControlNet for structure-guided image generation."""

import torch
from diffusers import (
    StableDiffusionControlNetPipeline,
    ControlNetModel,
    UniPCMultistepScheduler,
)
from PIL import Image
import numpy as np


def generate_with_controlnet(
    prompt: str,
    control_image: Image.Image,
    controlnet_model: str = "lllyasviel/sd-controlnet-canny",
    sd_model: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
    num_inference_steps: int = 30,
    controlnet_conditioning_scale: float = 1.0,
) -> Image.Image:
    """
    Generate an image following the structure of a control image.

    Args:
        prompt: Text description
        control_image: Preprocessed control image (edges, pose, depth)
        controlnet_model: ControlNet model for the control type
        sd_model: Base Stable Diffusion model
        num_inference_steps: Denoising steps
        controlnet_conditioning_scale: How strongly to follow the control
                                       (0.0 = ignore, 1.0 = strict)
    """
    controlnet = ControlNetModel.from_pretrained(
        controlnet_model,
        torch_dtype=torch.float16,
    )

    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        sd_model,
        controlnet=controlnet,
        torch_dtype=torch.float16,
    ).to("cuda")

    pipe.scheduler = UniPCMultistepScheduler.from_config(
        pipe.scheduler.config
    )

    result = pipe(
        prompt=prompt,
        image=control_image,
        num_inference_steps=num_inference_steps,
        controlnet_conditioning_scale=controlnet_conditioning_scale,
    )

    return result.images[0]


def extract_canny_edges(
    image: Image.Image,
    low_threshold: int = 100,
    high_threshold: int = 200,
) -> Image.Image:
    """Extract Canny edges from an image for ControlNet."""
    import cv2

    img_array = np.array(image)
    edges = cv2.Canny(img_array, low_threshold, high_threshold)
    return Image.fromarray(edges)

Step 6: LoRA Style Adapters

src/lora_styles.py

"""Apply LoRA style adapters to Stable Diffusion."""

import torch
from diffusers import StableDiffusionPipeline


def generate_with_lora(
    prompt: str,
    lora_model_id: str,
    base_model: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
    lora_scale: float = 1.0,
    num_inference_steps: int = 30,
):
    """
    Generate an image with a LoRA style adapter.

    LoRA adapters are small files (typically 2-50MB) that modify
    the model's style without changing the full weights (2-7GB).
    """
    pipe = StableDiffusionPipeline.from_pretrained(
        base_model,
        torch_dtype=torch.float16,
    ).to("cuda")

    # Load LoRA weights from Hub
    pipe.load_lora_weights(lora_model_id)

    result = pipe(
        prompt=prompt,
        num_inference_steps=num_inference_steps,
        cross_attention_kwargs={"scale": lora_scale},
    )

    # Unload LoRA to free memory
    pipe.unload_lora_weights()

    return result.images[0]


def merge_multiple_loras(
    prompt: str,
    lora_ids: list[str],
    lora_scales: list[float],
    base_model: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
):
    """Combine multiple LoRA adapters for blended styles."""
    pipe = StableDiffusionPipeline.from_pretrained(
        base_model,
        torch_dtype=torch.float16,
    ).to("cuda")

    for lora_id, scale in zip(lora_ids, lora_scales):
        pipe.load_lora_weights(lora_id, adapter_name=lora_id.split("/")[-1])

    # Set adapter weights
    adapter_names = [lid.split("/")[-1] for lid in lora_ids]
    pipe.set_adapters(adapter_names, adapter_weights=lora_scales)

    result = pipe(prompt=prompt, num_inference_steps=30)
    return result.images[0]

Step 7: Memory Optimization

src/optimization.py

"""Memory and speed optimization for diffusion pipelines."""

import torch
from diffusers import StableDiffusionPipeline


def setup_optimized_pipeline(
    model_id: str = "stable-diffusion-v1-5/stable-diffusion-v1-5",
) -> StableDiffusionPipeline:
    """Create a memory-optimized pipeline."""
    pipe = StableDiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    )

    # Option 1: Standard GPU (needs ~4GB VRAM)
    pipe = pipe.to("cuda")

    # Option 2: Sequential CPU offload (needs ~2GB VRAM)
    # pipe.enable_sequential_cpu_offload()

    # Option 3: Model CPU offload (needs ~3GB VRAM, faster than sequential)
    # pipe.enable_model_cpu_offload()

    # Enable memory-efficient attention
    # (requires xformers or PyTorch 2.0+)
    pipe.enable_attention_slicing()

    # Enable VAE slicing for large images
    pipe.enable_vae_slicing()

    return pipe

Memory Optimization Options:

Technique	VRAM Usage	Speed	Setup
Full GPU	~7 GB	Fastest	`.to("cuda")`
`enable_attention_slicing`	~5 GB	Slightly slower	One line
`enable_model_cpu_offload`	~3 GB	Slower	One line
`enable_sequential_cpu_offload`	~2 GB	Slowest	One line
`torch_dtype=float16`	Half of fp32	Same	On load

Running the Project

# Install dependencies
pip install -r requirements.txt

# Generate an image
python -c "
from src.text_to_image import generate_image
img = generate_image('A serene lake at sunset, photorealistic', seed=42)
img.save('output.png')
"

# Compare schedulers
python -c "
from src.schedulers import compare_schedulers
results = compare_schedulers('A fantasy castle in the clouds')
for name, img in results.items():
    img.save(f'scheduler_{name}.png')
"

Key Concepts Recap

Concept	What It Is	Why It Matters
Latent Diffusion	Denoising in compressed latent space	48x more efficient than pixel-space diffusion
Guidance Scale	Amplifies prompt conditioning	Controls creativity vs prompt adherence
Scheduler	Controls the denoising step schedule	DPM++ needs 15 steps vs DDPM's 1000
ControlNet	Additional conditioning on structure	Generate images following edges, poses, depth
LoRA Adapter	Small weight deltas for style transfer	Change style with 2-50MB instead of 2-7GB
Negative Prompt	What to avoid in generation	Dramatically improves output quality
CPU Offload	Move unused model parts to CPU	Run SDXL on 4GB VRAM GPUs

Next Steps

Fine-Tuning with PEFT — Train your own LoRA adapters
Production AI Workbench — Build a Gradio UI for image generation

Image Generation with Diffusers

Image Generation with Diffusers

What You'll Learn

Why Diffusion Models Changed Image Generation

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Text-to-Image Generation

Step 3: Noise Schedulers

Step 4: Image-to-Image and Inpainting

Step 5: ControlNet

Step 6: LoRA Style Adapters

Step 7: Memory Optimization

Running the Project

Key Concepts Recap

Next Steps

On this page

Image Generation with Diffusers

Image Generation with Diffusers

What You'll Learn

Why Diffusion Models Changed Image Generation

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Text-to-Image Generation

Step 3: Noise Schedulers

Step 4: Image-to-Image and Inpainting

Step 5: ControlNet

Step 6: LoRA Style Adapters

Step 7: Memory Optimization

Running the Project

Key Concepts Recap

Next Steps

On this page