Introduction

SD-Turbo is a generative text-to-image model that enables the synthesis of photorealistic images from text prompts in a single network evaluation. It is a distilled version of Stable Diffusion 2.1, designed for real-time synthesis using Adversarial Diffusion Distillation (ADD). This model is released as a research artifact to study small, distilled text-to-image models.

Architecture

SD-Turbo is based on a novel training method called Adversarial Diffusion Distillation (ADD), which combines score distillation with an adversarial loss to maintain high image fidelity even in low-step regimes. It leverages large-scale off-the-shelf image diffusion models as a teacher signal for improved image quality.

Training

The training of SD-Turbo employs ADD, which allows the model to sample foundational image diffusion models in 1 to 4 steps. This results in high-quality images with a single step, preferred by human voters over other models for image quality and prompt alignment.

Guide: Running Locally

  1. Install Required Packages:

    pip install diffusers transformers accelerate --upgrade
    
  2. Text-to-Image Generation:

    from diffusers import AutoPipelineForText2Image
    import torch
    
    pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/sd-turbo", torch_dtype=torch.float16, variant="fp16")
    pipe.to("cuda")
    
    prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe."
    image = pipe(prompt=prompt, num_inference_steps=1, guidance_scale=0.0).images[0]
    
  3. Image-to-Image Generation:

    from diffusers import AutoPipelineForImage2Image
    from diffusers.utils import load_image
    import torch
    
    pipe = AutoPipelineForImage2Image.from_pretrained("stabilityai/sd-turbo", torch_dtype=torch.float16, variant="fp16")
    pipe.to("cuda")
    
    init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png").resize((512, 512))
    prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
    
    image = pipe(prompt, image=init_image, num_inference_steps=2, strength=0.5, guidance_scale=0.0).images[0]
    
  4. Recommendation: For optimal performance, consider using cloud GPUs from providers such as AWS, GCP, or Azure.

License

For commercial use, please refer to the Stability AI license at https://stability.ai/license. Excluded uses and further guidelines are provided in the Stability AI's Acceptable Use Policy.

More Related APIs in Text To Image