Sana_1600 M_2 Kpx_ B F16_diffusers

Efficient-Large-Model

Sana 1600M 2Kpx BF16 Diffusers

Introduction

Sana is a text-to-image framework capable of generating high-quality images up to 4096 × 4096 resolution with strong text-image alignment. It is designed for efficient deployment, even on laptop GPUs. The model can produce images quickly and supports both English and Chinese languages.

Architecture

Sana employs a Linear-Diffusion-Transformer-based model with 1648 million parameters. It generates 2Kpx based images using multi-scale height and width. The model uses a fixed, pretrained text encoder (Gemma2-2B-IT) and a 32x spatial-compressed latent feature encoder (DC-AE).

Training

The model has been fine-tuned from the Efficient-Large-Model/Sana_1600M_1024px_BF16 and supports mixed prompts, including emojis. While it excels in generating high-quality images, it struggles with complex scenes and human hand representations.

Guide: Running Locally

  1. Installation:
    • Run pip install git+https://github.com/huggingface/diffusers to install necessary dependencies.
  2. Setup:
    • Import required libraries and load the model using SanaPipeline or SanaPAGPipeline from the diffusers library.
  3. Configuration:
    • Set torch_dtype to torch.bfloat16 and use a GPU by setting pipe.to("cuda").
  4. Execution:
    • Define a prompt and generate an image using the pipe object. Save the resulting image locally.

For optimal performance, utilizing cloud GPUs like AWS, Google Cloud, or Azure is recommended.

License

The model is released under the CC BY-NC-SA 4.0 License, allowing for non-commercial adaptation and distribution with appropriate credit.

More Related APIs in Text To Image