stable diffusion v1 1

CompVis

Introduction

Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images from text input. It was developed by Robin Rombach and Patrick Esser, utilizing a diffusion-based approach for text-to-image generation. The model is designed for research purposes and should not be used to generate harmful or illegal content.

Architecture

Stable Diffusion uses a Latent Diffusion Model which integrates an autoencoder with a diffusion model operating in the latent space. It employs a fixed, pretrained text encoder CLIP ViT-L/14 to process text prompts. The model's architecture supports high-resolution image synthesis and is trained using large datasets like LAION-5B.

Training

The model underwent multiple training phases:

  • Stable-Diffusion-v1-1: Trained on 237,000 steps at 256x256 resolution and 194,000 steps at 512x512 on laion-high-resolution.
  • Stable-Diffusion-v1-2 to v1-4: Further refined with additional steps focusing on improved aesthetics and guidance sampling. Training utilized 32 x 8 x A100 GPUs, with an AdamW optimizer and a learning rate of 0.0001 after warmup.

Guide: Running Locally

To run Stable Diffusion locally, follow these steps:

  1. Install Required Libraries:
    pip install --upgrade diffusers transformers scipy
    
  2. Set Up the Environment:
    import torch
    from diffusers import StableDiffusionPipeline
    model_id = "CompVis/stable-diffusion-v1-1"
    device = "cuda"
    pipe = StableDiffusionPipeline.from_pretrained(model_id)
    pipe = pipe.to(device)
    
  3. Generate Images:
    prompt = "a photo of an astronaut riding a horse on mars"
    with torch.autocast("cuda"):
        image = pipe(prompt)["sample"][0]
    image.save("astronaut_rides_horse.png")
    
  4. GPU Requirements: A cloud GPU, such as an AWS A100, is recommended for optimal performance.

License

The model is open access under the CreativeML OpenRAIL-M license, which allows commercial use and redistribution with certain restrictions. Users must adhere to guidelines against generating harmful content and share the license with any redistributed versions. More details are available at the license page here.

More Related APIs in Text To Image