stable diffusion 3 medium

stabilityai

Introduction

Stable Diffusion 3 Medium is a Multimodal Diffusion Transformer (MMDiT) text-to-image model developed by Stability AI. It offers improved performance in image quality, typography, complex prompt understanding, and resource efficiency. The model leverages three fixed, pretrained text encoders: OpenCLIP-ViT/G, CLIP-ViT/L, and T5-xxl.

Architecture

Stable Diffusion 3 Medium uses a Multimodal Diffusion Transformer (MMDiT) architecture. It incorporates three pretrained text encoders to process input prompts: OpenCLIP-ViT/G, CLIP-ViT/L, and T5-xxl. The model is designed to generate images from text prompts efficiently and effectively.

Training

The model was trained on a combination of synthetic data and filtered publicly available data, totaling 1 billion images. Fine-tuning was performed using 30 million high-quality aesthetic images and 3 million preference data images, focusing on specific visual content and style.

Guide: Running Locally

To run Stable Diffusion 3 Medium locally, follow these steps:

  1. Install Dependencies: Ensure you have Python and the necessary libraries installed. Use pip install -U diffusers to get the latest version of Diffusers.

  2. Download the Model: Obtain the model weights from the repository and choose the appropriate variant (e.g., sd3_medium.safetensors).

  3. Set Up Environment: Use ComfyUI or another compatible interface for inference. You might need to configure CUDA for GPU acceleration.

  4. Run the Model: Use the provided code snippet to generate images from text prompts.

    import torch
    from diffusers import StableDiffusion3Pipeline
    
    pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
    pipe = pipe.to("cuda")
    
    image = pipe(
        "A cat holding a sign that says hello world",
        negative_prompt="",
        num_inference_steps=28,
        guidance_scale=7.0,
    ).images[0]
    
  5. Optimize: Refer to the documentation for additional details on optimization and image-to-image support.

Cloud GPUs

For better performance, consider using cloud GPUs from providers such as AWS, Google Cloud, or Azure.

License

Stable Diffusion 3 Medium is released under the Stability Community License. It is free for research, non-commercial, and commercial use for organizations or individuals with less than $1M in annual revenue. Entities exceeding this threshold must acquire an Enterprise license. More information is available at Stability AI License.

More Related APIs in Text To Image