stable diffusion 3 medium

ckpt

Introduction

Stable Diffusion 3 Medium is an advanced Multimodal Diffusion Transformer (MMDiT) text-to-image model developed by Stability AI. It features enhanced image quality, typography, complex prompt comprehension, and efficient resource use.

Architecture

The model utilizes a MMDiT architecture and incorporates three fixed, pretrained text encoders: OpenCLIP-ViT/G, CLIP-ViT/L, and T5-xxl. These components enable the model to generate images based on text prompts with improved performance.

Training

Stable Diffusion 3 Medium was trained using synthetic data and filtered publicly available data, totaling 1 billion images. Fine-tuning involved 30 million high-quality aesthetic images and 3 million preference data images, focusing on specific visual styles and content.

Guide: Running Locally

  1. Prepare Environment: Install necessary dependencies and clone the model repository.
  2. Download Model Files: Retrieve the necessary .safetensors files, choosing from three packaging variants based on resource needs.
  3. Set Up Inference Framework: Use ComfyUI or similar tools for local inference.
  4. Run the Model: Implement workflows using example JSON files provided, such as sd3_medium_example_workflow_basic.json.

For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

Stable Diffusion 3 Medium is available under the Stability AI Non-Commercial Research Community License, permitting free use for non-commercial purposes, such as academic research. Commercial use requires a separate license from Stability AI. Visit Stability AI License for more details.

More Related APIs in Text To Image