stable diffusion 3.5 medium

stabilityai

Introduction

Stable Diffusion 3.5 Medium is a text-to-image generation model developed by Stability AI. It is an advanced Multimodal Diffusion Transformer (MMDiT-X) that offers enhanced image quality, typography, prompt understanding, and resource efficiency.

Architecture

The model is a Multimodal Diffusion Transformer (MMDiT-X) that employs self-attention modules in the first 13 layers, QK normalization for training stability, and mixed-resolution training from 256 to 1440. It uses three fixed, pretrained text encoders, including CLIPs and T5, to handle text prompts effectively.

Training

Stable Diffusion 3.5 Medium is trained on diverse data, including synthetic and publicly available datasets. The training strategy involves progressive resolution increases and mixed-scale image training, enhancing its multi-resolution performance and robustness.

Guide: Running Locally

  1. Install Dependencies:

    • Upgrade to the latest version of the diffusers library:
      pip install -U diffusers
      
    • For quantization, install bitsandbytes:
      pip install bitsandbytes
      
  2. Load and Run Model:

    • Use the pre-trained model from Stability AI with the following Python script:
      import torch
      from diffusers import StableDiffusion3Pipeline
      
      pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.bfloat16)
      pipe = pipe.to("cuda")
      
      image = pipe(
          "A capybara holding a sign that reads Hello World",
          num_inference_steps=40,
          guidance_scale=4.5,
      ).images[0]
      image.save("capybara.png")
      
    • For quantized model execution to reduce VRAM usage:
      from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
      from diffusers import StableDiffusion3Pipeline
      import torch
      
      model_id = "stabilityai/stable-diffusion-3.5-medium"
      nf4_config = BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_quant_type="nf4",
          bnb_4bit_compute_dtype=torch.bfloat16
      )
      model_nf4 = SD3Transformer2DModel.from_pretrained(
          model_id,
          subfolder="transformer",
          quantization_config=nf4_config,
          torch_dtype=torch.bfloat16
      )
      
      pipeline = StableDiffusion3Pipeline.from_pretrained(
          model_id, 
          transformer=model_nf4,
          torch_dtype=torch.bfloat16
      )
      pipeline.enable_model_cpu_offload()
      
      prompt = "A whimsical image of a waffle-hippopotamus hybrid."
      image = pipeline(
          prompt=prompt,
          num_inference_steps=40,
          guidance_scale=4.5,
          max_sequence_length=512,
      ).images[0]
      image.save("whimsical.png")
      
  3. Cloud GPUs:

    • Consider using cloud GPU services like AWS, Google Cloud, or Azure for optimal performance.

License

The model is available under the Stability Community License, allowing research and non-commercial use for entities with less than $1M in annual revenue. For commercial use above this threshold, an Enterprise License is necessary. More details are available in the Community License Agreement.

More Related APIs in Text To Image