stable diffusion v1 5

stable-diffusion-v1-5

Introduction

Stable Diffusion V1-5 is a latent text-to-image diffusion model capable of generating photo-realistic images based on text input. It is a mirror of the deprecated RunwayML/Stable-Diffusion-V1-5 repository and is not affiliated with RunwayML. The model utilizes a diffusion-based approach to generate images from text prompts, leveraging the Diffusers library.

Architecture

Stable Diffusion V1-5 uses a latent diffusion model that combines an autoencoder with a diffusion model, trained in the latent space of the autoencoder. It employs a fixed pretrained text encoder, CLIP ViT-L/14, to encode text prompts, which are then fed into the UNet backbone of the latent diffusion model via cross-attention. The model was fine-tuned at a resolution of 512x512 and involves a reconstruction objective for training.

Training

The model was trained on the LAION-2B(en) dataset and subsets, using a combination of autoencoding and diffusion techniques. Training involved 595,000 steps at a resolution of 512x512 on the "laion-aesthetics v2 5+" dataset. The model was trained using 32 x 8 x A100 GPUs with AdamW optimizer, a learning rate warmup, and gradient accumulation.

Guide: Running Locally

  1. Setup Environment: Install necessary libraries such as diffusers and torch.
  2. Download Weights: Choose between v1-5-pruned-emaonly.safetensors for inference or v1-5-pruned.safetensors for fine-tuning.
  3. Load Model: Use the StableDiffusionPipeline from the Diffusers library to load the model.
  4. Run Inference: Provide a text prompt to generate an image.
  5. Save Output: Save the generated image locally.

Consider using cloud GPUs like AWS or Google Cloud for efficient processing.

License

The model is licensed under the CreativeML OpenRAIL M license, which is an Open RAIL M license adapted for responsible AI usage, based on work by BigScience and the RAIL Initiative. For more details, refer to the Stable Diffusion License.

More Related APIs in Text To Image