stable diffusion 2 1
stabilityaiIntroduction
Stable Diffusion v2-1 is a diffusion-based text-to-image generation model developed by Robin Rombach and Patrick Esser. It is designed to generate and modify images based on text prompts, leveraging a pretrained text encoder, OpenCLIP-ViT/H. The model operates under the CreativeML Open RAIL++-M License and is available through the Stability AI GitHub repository.
Architecture
Stable Diffusion v2-1 is a latent diffusion model that combines an autoencoder with a diffusion model in the latent space. It employs a UNet backbone with cross-attention to encode text prompts, and the model outputs are determined by a reconstruction objective that predicts the added noise in the latent space. The architecture supports functionalities like image inpainting and upscaling through specialized checkpoints.
Training
The model was trained using subsets of the LAION-5B dataset, filtered to remove explicit content. Training involved several stages:
- Initial training with 512-base-ema.ckpt on a 256x256 resolution
- Further training on a 512x512 resolution dataset
- Fine-tuning with additional conditioning inputs like depth prediction and image inpainting
- Utilization of 32 x 8 A100 GPUs, AdamW optimizer, and a learning rate with warmup steps
Guide: Running Locally
To run Stable Diffusion v2-1 locally using Hugging Face's Diffusers library, follow these steps:
-
Install Dependencies:
pip install diffusers transformers accelerate scipy safetensors
-
Load and Run the Model:
import torch from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler model_id = "stabilityai/stable-diffusion-2-1" pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe = pipe.to("cuda") prompt = "a photo of an astronaut riding a horse on mars" image = pipe(prompt).images[0] image.save("astronaut_rides_horse.png")
-
Performance Tips:
- Install
xformers
for memory-efficient attention. - Use
pipe.enable_attention_slicing()
to reduce VRAM usage.
- Install
For optimal performance, consider using cloud GPUs such as AWS EC2 instances with A100 GPUs.
License
The model is distributed under the CreativeML Open RAIL++-M License. The license allows for use in research and creative applications, with restrictions on misuse, including generating harmful, misleading, or offensive content. For more detailed terms, refer to the license documentation.