stable diffusion 3.5 large
stabilityaiIntroduction
Stable Diffusion 3.5 Large, developed by Stability AI, is a Multimodal Diffusion Transformer (MMDiT) text-to-image model. It offers enhanced image quality, typography, complex prompt understanding, and resource efficiency.
Architecture
The model utilizes three fixed, pretrained text encoders and QK-normalization to improve training stability. It incorporates OpenCLIP and T5-xxl as text encoders, with a context length of up to 256 tokens at different training stages.
Training
Stable Diffusion 3.5 was trained on diverse datasets, including synthetic and filtered publicly available data. It includes QK normalization for better training stability.
Model Stats Number
- Model Type: MMDiT text-to-image generative model
- Text Encoders: OpenCLIP, T5-xxl
- Training Data: Synthetic and publicly available data
Guide: Running Locally
-
Environment Setup:
- Install the latest version of the Hugging Face diffusers library:
pip install -U diffusers
- Install the latest version of the Hugging Face diffusers library:
-
Load and Run the Model:
import torch from diffusers import StableDiffusion3Pipeline pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16) pipe = pipe.to("cuda") image = pipe("A capybara holding a sign that reads Hello World", num_inference_steps=28, guidance_scale=3.5).images[0] image.save("capybara.png")
-
Quantization (Optional):
- Install
bitsandbytes
for quantization:pip install bitsandbytes
- Use quantized model for reduced VRAM usage.
- Install
Cloud GPUs: Consider using cloud services like AWS, GCP, or Azure for access to powerful GPUs.
License
The model is released under the Stability Community License, allowing free use for research, non-commercial, and commercial activities for entities with less than $1M in annual revenue. For those exceeding this revenue threshold, an Enterprise License is required. More details are available in the Community License Agreement.