Sana_1600 M_2 Kpx_ B F16
Efficient-Large-ModelIntroduction
Sana is a text-to-image framework designed for efficient image generation up to 4096 × 4096 resolution. It allows for high-resolution, high-quality image synthesis with robust text-image alignment, capable of running on a laptop GPU. The source code is available on GitHub.
Architecture
Sana employs a Linear-Diffusion-Transformer model with 1648M parameters, optimized for 2Kpx image generation. It utilizes a pretrained text encoder (Gemma2-2B-IT) and a spatial-compressed latent feature encoder (DC-AE). The model is fine-tuned from a base model, supporting mixed prompts in Emoji, Chinese, and English.
Training
Training details are not specified, but the model can be fine-tuned for enhanced capabilities. It currently shows limitations in complex scene generation and rendering realistic human features such as hands.
Guide: Running Locally
- Setup Environment: Ensure you have PyTorch and required libraries installed.
- Clone Repository: Obtain the code from GitHub.
- Prepare Model: Load the model using the provided
.pth
file. - Run Inference: Utilize the
SanaPipeline
to generate images from text prompts.
Suggested Cloud GPUs
For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure with CUDA support.
License
Sana is distributed under the CC BY-NC-SA 4.0 License, which allows for non-commercial use with attribution and share-alike conditions.