stable diffusion 3 medium
ckptIntroduction
Stable Diffusion 3 Medium is an advanced Multimodal Diffusion Transformer (MMDiT) text-to-image model developed by Stability AI. It features enhanced image quality, typography, complex prompt comprehension, and efficient resource use.
Architecture
The model utilizes a MMDiT architecture and incorporates three fixed, pretrained text encoders: OpenCLIP-ViT/G, CLIP-ViT/L, and T5-xxl. These components enable the model to generate images based on text prompts with improved performance.
Training
Stable Diffusion 3 Medium was trained using synthetic data and filtered publicly available data, totaling 1 billion images. Fine-tuning involved 30 million high-quality aesthetic images and 3 million preference data images, focusing on specific visual styles and content.
Guide: Running Locally
- Prepare Environment: Install necessary dependencies and clone the model repository.
- Download Model Files: Retrieve the necessary
.safetensors
files, choosing from three packaging variants based on resource needs. - Set Up Inference Framework: Use ComfyUI or similar tools for local inference.
- Run the Model: Implement workflows using example JSON files provided, such as
sd3_medium_example_workflow_basic.json
.
For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
Stable Diffusion 3 Medium is available under the Stability AI Non-Commercial Research Community License, permitting free use for non-commercial purposes, such as academic research. Commercial use requires a separate license from Stability AI. Visit Stability AI License for more details.