stable diffusion 2 1

stabilityai

Introduction

Stable Diffusion v2-1 is a diffusion-based text-to-image generation model developed by Robin Rombach and Patrick Esser. It is designed to generate and modify images based on text prompts, leveraging a pretrained text encoder, OpenCLIP-ViT/H. The model operates under the CreativeML Open RAIL++-M License and is available through the Stability AI GitHub repository.

Architecture

Stable Diffusion v2-1 is a latent diffusion model that combines an autoencoder with a diffusion model in the latent space. It employs a UNet backbone with cross-attention to encode text prompts, and the model outputs are determined by a reconstruction objective that predicts the added noise in the latent space. The architecture supports functionalities like image inpainting and upscaling through specialized checkpoints.

Training

The model was trained using subsets of the LAION-5B dataset, filtered to remove explicit content. Training involved several stages:

  • Initial training with 512-base-ema.ckpt on a 256x256 resolution
  • Further training on a 512x512 resolution dataset
  • Fine-tuning with additional conditioning inputs like depth prediction and image inpainting
  • Utilization of 32 x 8 A100 GPUs, AdamW optimizer, and a learning rate with warmup steps

Guide: Running Locally

To run Stable Diffusion v2-1 locally using Hugging Face's Diffusers library, follow these steps:

  1. Install Dependencies:

    pip install diffusers transformers accelerate scipy safetensors
    
  2. Load and Run the Model:

    import torch
    from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
    
    model_id = "stabilityai/stable-diffusion-2-1"
    pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
    pipe = pipe.to("cuda")
    
    prompt = "a photo of an astronaut riding a horse on mars"
    image = pipe(prompt).images[0]
    image.save("astronaut_rides_horse.png")
    
  3. Performance Tips:

    • Install xformers for memory-efficient attention.
    • Use pipe.enable_attention_slicing() to reduce VRAM usage.

For optimal performance, consider using cloud GPUs such as AWS EC2 instances with A100 GPUs.

License

The model is distributed under the CreativeML Open RAIL++-M License. The license allows for use in research and creative applications, with restrictions on misuse, including generating harmful, misleading, or offensive content. For more detailed terms, refer to the license documentation.

More Related APIs in Text To Image