stable diffusion v1 4

CompVis

Introduction

Stable Diffusion v1-4 is a text-to-image diffusion model developed by Robin Rombach and Patrick Esser, capable of generating photo-realistic images from text prompts. It uses a diffusion model within a latent space to create and modify images, leveraging a pre-trained text encoder (CLIP ViT-L/14). The model is designed for research purposes, exploring safe deployment, understanding biases and limitations, and generating artworks.

Architecture

Stable Diffusion is a diffusion-based text-to-image generation model. It employs a latent diffusion model that combines an autoencoder with a diffusion model, trained in the latent space. The model encodes images into latent representations and uses a ViT-L/14 text-encoder to process text prompts, which are fed into the UNet backbone via cross-attention.

Training

The model was trained on subsets of the LAION-2B dataset, with a focus on English captions. It underwent several iterations:

  • v1-1: 237,000 steps at 256x256, followed by 194,000 steps at 512x512 on high-resolution images.
  • v1-2: 515,000 steps at 512x512 using "laion-improved-aesthetics."
  • v1-3: 195,000 steps at 512x512 with additional text-conditioning drop.
  • v1-4: 225,000 steps at 512x512 with further improvements.

Training utilized 32 x 8 x A100 GPUs, with a batch size of 2048 and a learning rate of 0.0001 after warmup.

Guide: Running Locally

  1. Installation: Ensure you have Python and install the necessary libraries:
    pip install --upgrade diffusers transformers scipy
    
  2. Setup GPU: Use a CUDA-compatible GPU. For cloud options, consider AWS or Google Cloud for GPUs like the NVIDIA A100.
  3. Running the Model:
    import torch
    from diffusers import StableDiffusionPipeline
    
    model_id = "CompVis/stable-diffusion-v1-4"
    device = "cuda"
    
    pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
    pipe = pipe.to(device)
    
    prompt = "a photo of an astronaut riding a horse on mars"
    image = pipe(prompt).images[0]
    image.save("astronaut_rides_horse.png")
    
  4. Memory Considerations: If limited by GPU memory, load the model in float16 precision and enable attention slicing.

License

The model is released under the CreativeML OpenRAIL-M license, which allows for commercial use and redistribution with certain restrictions. Users must not produce illegal or harmful content and should share the license terms with any distributed versions. Full license details can be reviewed here.

More Related APIs in Text To Image