stable diffusion v1 4
NarsilIntroduction
Stable Diffusion v1-4 is a latent text-to-image diffusion model capable of generating photo-realistic images from textual descriptions. It builds on a foundation of previous versions, employing advanced techniques to produce high-resolution images. Developed by Robin Rombach and Patrick Esser, it is designed for research purposes, including the exploration of generative model biases and the creation of artistic works.
Architecture
Stable Diffusion is a diffusion-based text-to-image generation model utilizing a latent diffusion approach. It combines an autoencoder with a diffusion model, trained in the latent space. The model uses a pretrained text encoder (CLIP ViT-L/14) to process text prompts, integrating the encoded information into the UNet backbone via cross-attention mechanisms. It operates primarily in English and relies on datasets such as LAION-5B for training.
Training
The model was trained using subsets of the LAION-2B (en) dataset. Training involved encoding images into latent representations and processing text prompts through a ViT-L/14 text-encoder. The loss function focused on reconstructing the noise added to latents. Training utilized 32 A100 GPUs, with an optimizer of AdamW, and a learning rate warmed up to 0.0001 over 10,000 steps. The model underwent multiple iterations, with each checkpoint building on the previous version, enhanced through different training steps and dataset subsets.
Guide: Running Locally
-
Installation: Install the required libraries using:
pip install --upgrade diffusers transformers scipy
-
Authentication: Log in to the Hugging Face Hub with:
huggingface-cli login
-
Setup: Use the following code snippet to run the model, ensuring you have access to a CUDA-compatible device for optimal performance:
import torch from torch import autocast from diffusers import StableDiffusionPipeline model_id = "CompVis/stable-diffusion-v1-4" device = "cuda" generator = torch.Generator(device=device).manual_seed(0) pipe = StableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True) pipe = pipe.to(device) prompt = "a photograph of an astronaut riding a horse" with autocast("cuda"): image = pipe(prompt, generator=generator)["sample"][0] image.save(f"astronaut_rides_horse.png")
-
Cloud GPUs: For enhanced performance, consider using cloud-based GPU services like AWS EC2, Google Cloud, or Azure.
License
Stable Diffusion v1-4 is released under the CreativeML OpenRAIL-M license. This license allows for commercial use and redistribution, provided that the same restrictions are adhered to and that a copy of the license is shared with users. Key provisions include prohibitions against generating illegal or harmful outputs and the obligation to respect copyright terms. For full terms, refer to the license documentation.