Sana_1600 M_2 Kpx_ B F16_diffusers
Efficient-Large-ModelSana 1600M 2Kpx BF16 Diffusers
Introduction
Sana is a text-to-image framework capable of generating high-quality images up to 4096 × 4096 resolution with strong text-image alignment. It is designed for efficient deployment, even on laptop GPUs. The model can produce images quickly and supports both English and Chinese languages.
Architecture
Sana employs a Linear-Diffusion-Transformer-based model with 1648 million parameters. It generates 2Kpx based images using multi-scale height and width. The model uses a fixed, pretrained text encoder (Gemma2-2B-IT) and a 32x spatial-compressed latent feature encoder (DC-AE).
Training
The model has been fine-tuned from the Efficient-Large-Model/Sana_1600M_1024px_BF16 and supports mixed prompts, including emojis. While it excels in generating high-quality images, it struggles with complex scenes and human hand representations.
Guide: Running Locally
- Installation:
- Run
pip install git+https://github.com/huggingface/diffusers
to install necessary dependencies.
- Run
- Setup:
- Import required libraries and load the model using
SanaPipeline
orSanaPAGPipeline
from thediffusers
library.
- Import required libraries and load the model using
- Configuration:
- Set
torch_dtype
totorch.bfloat16
and use a GPU by settingpipe.to("cuda")
.
- Set
- Execution:
- Define a prompt and generate an image using the pipe object. Save the resulting image locally.
For optimal performance, utilizing cloud GPUs like AWS, Google Cloud, or Azure is recommended.
License
The model is released under the CC BY-NC-SA 4.0 License, allowing for non-commercial adaptation and distribution with appropriate credit.