stable diffusion v 1 4 original
CompVisIntroduction
Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images from text prompts. Developed by Robin Rombach and Patrick Esser, this model leverages a fixed, pretrained text encoder (CLIP ViT-L/14) to enable high-resolution image synthesis.
Architecture
The model is a diffusion-based text-to-image generation system. It utilizes an autoencoder to encode images into latent representations and a U-Net backbone that incorporates text encodings via cross-attention. The model is trained in the latent space of the autoencoder, which maps images to smaller latent dimensions.
Training
Stable Diffusion was trained using the LAION-2B(en) dataset and its subsets. The training involved several checkpoint phases, each building on the previous one, with resolutions of 256x256 and 512x512. The training utilized 32 A100 GPUs with an AdamW optimizer. Checkpoints are available for different stages of training progress.
Guide: Running Locally
-
Download Weights: Obtain the model weights from the Hugging Face repository.
-
Environment Setup: Clone the original CompVis Stable Diffusion codebase and set up the environment according to the README instructions.
-
Inference: Use a script to load the model and generate images from text prompts. Ensure you have an appropriate GPU setup; cloud GPUs such as AWS or Google Cloud with A100 GPUs are recommended for optimal performance.
License
The model is released under the CreativeML OpenRAIL-M license, which allows open access and commercial use with specific restrictions on harmful and illegal content generation. Users must agree to the license terms, ensuring responsible use and distribution of the model. For full license details, refer to the CreativeML OpenRAIL-M license.