stable diffusion 2 1 base

stabilityai

Introduction

The Stable Diffusion v2-1-base model is a diffusion-based text-to-image generation model, developed by Stability AI. It is designed to generate and modify images based on text prompts using Latent Diffusion Models. The model fine-tunes the stable-diffusion-2-base with additional training steps to enhance performance.

Architecture

  • Model Type: Latent Diffusion Model
  • Text Encoder: OpenCLIP-ViT/H
  • Image Encoder: Autoencoder converting images to latent representations
  • Backbone: UNet with cross-attention
  • Languages: Primarily English

Training

The model was trained on the LAION-5B dataset and its subsets, filtered using LAION's NSFW detector. The training incorporates a latent diffusion model with an autoencoder and uses a reconstruction objective for loss computation. Checkpoints for different versions include:

  • Version 2.1: Fine-tuning with additional steps; includes 512-base-ema.ckpt and 768-v-ema.ckpt.
  • Version 2.0: Initial training with various configurations for different tasks such as depth processing and inpainting.

The training hardware consisted of 32 x 8 A100 GPUs, using the AdamW optimizer.

Guide: Running Locally

  1. Installation:

    pip install diffusers transformers accelerate scipy safetensors
    
  2. Running the Model:

    from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
    import torch
    
    model_id = "stabilityai/stable-diffusion-2-1-base"
    scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
    pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
    pipe = pipe.to("cuda")
    
    prompt = "a photo of an astronaut riding a horse on mars"
    image = pipe(prompt).images[0]  
    image.save("astronaut_rides_horse.png")
    
  3. Recommendations:

    • Use cloud GPUs like AWS, GCP, or Azure for efficient computation.
    • Install xformers for memory-efficient attention.
    • For low GPU RAM, enable attention slicing with pipe.enable_attention_slicing().

License

The model is licensed under the CreativeML Open RAIL++-M License. This license allows for specific use cases and restricts misuse, particularly in generating harmful content or violating terms of use for copyrighted materials.

More Related APIs in Text To Image