stable diffusion

CompVis

Introduction

Stable Diffusion is a latent text-to-image diffusion model that generates photo-realistic images from text inputs. It provides various model checkpoints, with higher versions typically offering improved image generation quality.

Architecture

Stable Diffusion utilizes a diffusion model architecture to transform latent text inputs into detailed images. The model has been progressively improved across multiple versions by training on extensive datasets.

Training

  • Stable-Diffusion-v1-1: Initially trained for 237,000 steps at 256x256 resolution and 194,000 steps at 512x512 resolution on different LAION datasets.
  • Stable-Diffusion-v1-2: Continued from v1-1 with 515,000 steps at 512x512 resolution on a filtered LAION dataset emphasizing aesthetics and watermark filtering.
  • Stable-Diffusion-v1-3: Extended from v1-2 with 195,000 additional steps, incorporating a 10% drop in text-conditioning to enhance classifier-free guidance sampling.
  • Stable-Diffusion-v1-4: Further trained from v1-2 with 225,000 steps, focusing on improved aesthetics and classifier-free guidance.

Guide: Running Locally

To run Stable Diffusion locally:

  1. Setup Environment:

    • Install the necessary Python libraries, including Hugging Face's Diffusers library.
    • Obtain the model checkpoints from the Hugging Face model repository.
  2. Load Model:

    • Use the Diffusers library or the original Stable Diffusion GitHub repository to load the model.
  3. Execution:

    • Input text prompts and execute the model to generate images.
  4. Hardware Recommendation:

    • For optimal performance, especially for higher-resolution images, use cloud GPUs like those available from Google Cloud or AWS.

License

The model is licensed under the CreativeML OpenRAIL M license. This license is based on the principles of responsible AI, as developed by BigScience and the RAIL Initiative. More details about the license can be found on the Hugging Face website.

More Related APIs in Text To Image