stable diffusion v 1 1 original
CompVisStable Diffusion V1 Model Card
Introduction
Stable Diffusion is a latent text-to-image diffusion model designed to generate realistic images from text inputs. It was developed by Robin Rombach and Patrick Esser and utilizes the CreativeML OpenRAIL-M license.
Architecture
The model architecture includes a diffusion-based text-to-image generation mechanism, which integrates an autoencoder with a diffusion model trained in the latent space. A CLIP ViT-L/14 text encoder is used for processing text prompts, whose outputs are incorporated into the UNet backbone of the diffusion model via cross-attention.
Training
Stable Diffusion V1 was trained using the LAION-2B (en) dataset and its subsets. The training involved multiple checkpoints:
- sd-v1-1.ckpt: 237,000 steps at 256x256 resolution and 194,000 steps at 512x512 resolution.
- sd-v1-2.ckpt: Continued from sd-v1-1.ckpt with 515,000 steps at 512x512 resolution.
- sd-v1-3.ckpt: Continued from sd-v1-2.ckpt with 195,000 steps at 512x512 resolution.
Key training details:
- Hardware: 32 x 8 x A100 GPUs
- Optimizer: AdamW
- Batch size: 2048
- Learning rate: Warmup to 0.0001 for 10,000 steps
Guide: Running Locally
-
Download Weights: Obtain the model weights from the following links:
-
Codebase: Use the original CompVis Stable Diffusion codebase available on GitHub.
-
Setup Environment: Ensure you have a compatible environment, preferably with a GPU for efficient processing. Cloud GPU services like AWS, GCP, or Azure are recommended for running the model.
-
Run Model: Follow instructions in the codebase repository to execute the model using your text inputs.
License
The model is licensed under the CreativeML OpenRAIL-M license. Key conditions include:
- No intentional production or sharing of illegal or harmful content.
- Free use of generated outputs with accountability for usage.
- Redistribution of weights and commercial use must adhere to the same license restrictions.