stable diffusion 3.5 medium
stabilityaiIntroduction
Stable Diffusion 3.5 Medium is a text-to-image generation model developed by Stability AI. It is an advanced Multimodal Diffusion Transformer (MMDiT-X) that offers enhanced image quality, typography, prompt understanding, and resource efficiency.
Architecture
The model is a Multimodal Diffusion Transformer (MMDiT-X) that employs self-attention modules in the first 13 layers, QK normalization for training stability, and mixed-resolution training from 256 to 1440. It uses three fixed, pretrained text encoders, including CLIPs and T5, to handle text prompts effectively.
Training
Stable Diffusion 3.5 Medium is trained on diverse data, including synthetic and publicly available datasets. The training strategy involves progressive resolution increases and mixed-scale image training, enhancing its multi-resolution performance and robustness.
Guide: Running Locally
-
Install Dependencies:
- Upgrade to the latest version of the
diffusers
library:pip install -U diffusers
- For quantization, install
bitsandbytes
:pip install bitsandbytes
- Upgrade to the latest version of the
-
Load and Run Model:
- Use the pre-trained model from Stability AI with the following Python script:
import torch from diffusers import StableDiffusion3Pipeline pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.bfloat16) pipe = pipe.to("cuda") image = pipe( "A capybara holding a sign that reads Hello World", num_inference_steps=40, guidance_scale=4.5, ).images[0] image.save("capybara.png")
- For quantized model execution to reduce VRAM usage:
from diffusers import BitsAndBytesConfig, SD3Transformer2DModel from diffusers import StableDiffusion3Pipeline import torch model_id = "stabilityai/stable-diffusion-3.5-medium" nf4_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model_nf4 = SD3Transformer2DModel.from_pretrained( model_id, subfolder="transformer", quantization_config=nf4_config, torch_dtype=torch.bfloat16 ) pipeline = StableDiffusion3Pipeline.from_pretrained( model_id, transformer=model_nf4, torch_dtype=torch.bfloat16 ) pipeline.enable_model_cpu_offload() prompt = "A whimsical image of a waffle-hippopotamus hybrid." image = pipeline( prompt=prompt, num_inference_steps=40, guidance_scale=4.5, max_sequence_length=512, ).images[0] image.save("whimsical.png")
- Use the pre-trained model from Stability AI with the following Python script:
-
Cloud GPUs:
- Consider using cloud GPU services like AWS, Google Cloud, or Azure for optimal performance.
License
The model is available under the Stability Community License, allowing research and non-commercial use for entities with less than $1M in annual revenue. For commercial use above this threshold, an Enterprise License is necessary. More details are available in the Community License Agreement.