stable diffusion 2

stabilityai

Introduction

The Stable Diffusion v2 model, developed by Stability AI, is a diffusion-based text-to-image generation model designed to generate and modify images based on text prompts. It utilizes a pretrained text encoder and operates as a Latent Diffusion Model.

Architecture

The model combines an autoencoder with a diffusion model. Images are encoded into latent representations, and text prompts are encoded through OpenCLIP-ViT/H. The model uses cross-attention to integrate text and image information, with the UNet backbone handling the diffusion process.

Training

Stable Diffusion v2 was trained on the LAION-5B dataset and its subsets, using a latent diffusion model. The training process involved encoding images and text, feeding them into the model, and optimizing a reconstruction objective. The model was trained on various configurations, including resolutions and additional conditioning inputs, using A100 GPUs and AdamW optimizer.

Guide: Running Locally

  1. Install Dependencies

    • Install required libraries:
      pip install diffusers transformers accelerate scipy safetensors
      
  2. Run the Pipeline

    • Example code:
      from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
      
      model_id = "stabilityai/stable-diffusion-2"
      scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
      pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
      pipe = pipe.to("cuda")
      
      prompt = "a photo of an astronaut riding a horse on mars"
      image = pipe(prompt).images[0]
      image.save("astronaut_rides_horse.png")
      
  3. Recommendations

    • Install xformers for memory-efficient attention.
    • Use pipe.enable_attention_slicing() for reduced VRAM usage on low-memory GPUs.
  4. Cloud GPUs

    • Consider using cloud services like AWS or Google Cloud for access to powerful GPUs.

License

The model is licensed under the CreativeML Open RAIL++-M License, which includes provisions for responsible use and limitations on misuse and harmful applications.

More Related APIs in Text To Image