stable diffusion 3.5 large LLM Model

Introduction

Stable Diffusion 3.5 Large, developed by Stability AI, is a Multimodal Diffusion Transformer (MMDiT) text-to-image model. It offers enhanced image quality, typography, complex prompt understanding, and resource efficiency.

Architecture

The model utilizes three fixed, pretrained text encoders and QK-normalization to improve training stability. It incorporates OpenCLIP and T5-xxl as text encoders, with a context length of up to 256 tokens at different training stages.

Training

Stable Diffusion 3.5 was trained on diverse datasets, including synthetic and filtered publicly available data. It includes QK normalization for better training stability.

Model Stats Number

Model Type: MMDiT text-to-image generative model
Text Encoders: OpenCLIP, T5-xxl
Training Data: Synthetic and publicly available data

Guide: Running Locally

Environment Setup:
- Install the latest version of the Hugging Face diffusers library:
```
pip install -U diffusers
```

Load and Run the Model:

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe("A capybara holding a sign that reads Hello World", num_inference_steps=28, guidance_scale=3.5).images[0]
image.save("capybara.png")

Quantization (Optional):
- Install bitsandbytes for quantization:
```
pip install bitsandbytes
```
- Use quantized model for reduced VRAM usage.

Cloud GPUs: Consider using cloud services like AWS, GCP, or Azure for access to powerful GPUs.

License

The model is released under the Stability Community License, allowing free use for research, non-commercial, and commercial activities for entities with less than $1M in annual revenue. For those exceeding this revenue threshold, an Enterprise License is required. More details are available in the Community License Agreement.

More Related APIs in Text To Image