stable diffusion v 1 1 original

CompVis

Stable Diffusion V1 Model Card

Introduction

Stable Diffusion is a latent text-to-image diffusion model designed to generate realistic images from text inputs. It was developed by Robin Rombach and Patrick Esser and utilizes the CreativeML OpenRAIL-M license.

Architecture

The model architecture includes a diffusion-based text-to-image generation mechanism, which integrates an autoencoder with a diffusion model trained in the latent space. A CLIP ViT-L/14 text encoder is used for processing text prompts, whose outputs are incorporated into the UNet backbone of the diffusion model via cross-attention.

Training

Stable Diffusion V1 was trained using the LAION-2B (en) dataset and its subsets. The training involved multiple checkpoints:

  • sd-v1-1.ckpt: 237,000 steps at 256x256 resolution and 194,000 steps at 512x512 resolution.
  • sd-v1-2.ckpt: Continued from sd-v1-1.ckpt with 515,000 steps at 512x512 resolution.
  • sd-v1-3.ckpt: Continued from sd-v1-2.ckpt with 195,000 steps at 512x512 resolution.

Key training details:

  • Hardware: 32 x 8 x A100 GPUs
  • Optimizer: AdamW
  • Batch size: 2048
  • Learning rate: Warmup to 0.0001 for 10,000 steps

Guide: Running Locally

  1. Download Weights: Obtain the model weights from the following links:

  2. Codebase: Use the original CompVis Stable Diffusion codebase available on GitHub.

  3. Setup Environment: Ensure you have a compatible environment, preferably with a GPU for efficient processing. Cloud GPU services like AWS, GCP, or Azure are recommended for running the model.

  4. Run Model: Follow instructions in the codebase repository to execute the model using your text inputs.

License

The model is licensed under the CreativeML OpenRAIL-M license. Key conditions include:

  • No intentional production or sharing of illegal or harmful content.
  • Free use of generated outputs with accountability for usage.
  • Redistribution of weights and commercial use must adhere to the same license restrictions.

More Related APIs in Text To Image