← Back to Portfolio
Image Generation Model

Personal Project

Image Generation Model

PyTorchPython

Overview

A from-scratch implementation of a Stable Diffusion image generation inference pipeline in PyTorch. The project explored the core components of latent diffusion — the VAE encoder/decoder, the U-Net denoising network, and the CLIP text conditioning pipeline — as a deep learning study in generative modelling.


Role

Sole Researcher & Engineer


Problem

Generative image models are often treated as black boxes via API. The goal was to understand the internals of Stable Diffusion by implementing the key components for inference from the ground up rather than wrapping an existing pipeline — building intuition for latent spaces, diffusion schedules, and conditional generation.

Solution

Implemented the core Stable Diffusion inference pipeline in PyTorch: a Variational Autoencoder to project images into a compressed latent space and back, a U-Net with attention blocks for iterative denoising, and a DDPM noise scheduler. Text conditioning was applied via CLIP embeddings fed through cross-attention layers in the U-Net.


Architecture

A three-component pipeline: VAE for latent compression/decompression, a U-Net denoiser with cross-attention for text conditioning, and a noise scheduler controlling the reverse diffusion process.

Key Design Decisions

  • Variational Autoencoder (VAE) for encoding images into a 4-channel latent space and decoding generated latents back to pixel space
  • U-Net architecture with ResNet blocks and multi-head cross-attention for conditioning on CLIP text embeddings
  • DDPM noise scheduler implementing the forward diffusion process (for image-to-image) and reverse denoising
  • Text prompt conditioning via CLIP ViT embeddings injected through cross-attention at multiple U-Net resolutions
  • Custom script to map and load standard pre-trained Stable Diffusion weights (v1-5) into the custom PyTorch architecture

Challenges

  • Aligning tensor dimensions across the U-Net's multi-resolution skip connections and cross-attention heads
  • Mapping pre-trained Stable Diffusion weights accurately into custom implemented PyTorch classes and layers
  • Managing memory efficiently during inference, implementing device offloading strategies across CPU, CUDA, and MPS
  • Accurately reproducing the DDPM reverse process steps and variance calculations to denoise latents correctly

Impact

  • Built a working end-to-end text-to-image and image-to-image generation inference pipeline from mathematical foundations
  • Developed deep intuition for latent diffusion, attention mechanisms, and noise scheduling
  • Published as an open-source reference implementation on GitHub