Personal Project

Image Generation Model

PyTorchPython

Overview

A from-scratch implementation of a Stable Diffusion image generation inference pipeline in PyTorch. The project explored the core components of latent diffusion — the VAE encoder/decoder, the U-Net denoising network, and the CLIP text conditioning pipeline — as a deep learning study in generative modelling.

Role

Sole Researcher & Engineer

Problem

Generative image models are often treated as black boxes via API. The goal was to understand the internals of Stable Diffusion by implementing the key components for inference from the ground up rather than wrapping an existing pipeline — building intuition for latent spaces, diffusion schedules, and conditional generation.

Solution

Implemented the core Stable Diffusion inference pipeline in PyTorch: a Variational Autoencoder to project images into a compressed latent space and back, a U-Net with attention blocks for iterative denoising, and a DDPM noise scheduler. Text conditioning was applied via CLIP embeddings fed through cross-attention layers in the U-Net.

Architecture

A three-component pipeline: VAE for latent compression/decompression, a U-Net denoiser with cross-attention for text conditioning, and a noise scheduler controlling the reverse diffusion process.

Key Design Decisions

Variational Autoencoder (VAE) for encoding images into a 4-channel latent space and decoding generated latents back to pixel space
U-Net architecture with ResNet blocks and multi-head cross-attention for conditioning on CLIP text embeddings
DDPM noise scheduler implementing the forward diffusion process (for image-to-image) and reverse denoising
Text prompt conditioning via CLIP ViT embeddings injected through cross-attention at multiple U-Net resolutions
Custom script to map and load standard pre-trained Stable Diffusion weights (v1-5) into the custom PyTorch architecture

Challenges

Aligning tensor dimensions across the U-Net's multi-resolution skip connections and cross-attention heads
Mapping pre-trained Stable Diffusion weights accurately into custom implemented PyTorch classes and layers
Managing memory efficiently during inference, implementing device offloading strategies across CPU, CUDA, and MPS
Accurately reproducing the DDPM reverse process steps and variance calculations to denoise latents correctly

Impact

Built a working end-to-end text-to-image and image-to-image generation inference pipeline from mathematical foundations
Developed deep intuition for latent diffusion, attention mechanisms, and noise scheduling
Published as an open-source reference implementation on GitHub

View on GitHub ↗

← PreviousDistributed File Retrieval Engine Next →Similar Image Recommender