Sana_1600 M_1024px
Efficient-Large-ModelIntroduction
Sana is an advanced text-to-image generative model developed by NVIDIA, capable of synthesizing high-resolution images with strong text-image alignment. It can produce images up to 4096 × 4096 resolution efficiently and is designed to operate on consumer-grade hardware like a laptop GPU.
Architecture
Sana is built on a Linear-Diffusion-Transformer framework, incorporating a fixed pretrained text encoder (Gemma2-2B-IT) and a 32x spatial-compressed latent feature encoder (DC-AE). The model comprises 1648 million parameters and can generate images based on 1024px resolution with multi-scale height and width.
Training
Sana's development involved integrating advanced diffusion samplers like Flow-DPM-Solver, accessible through the generative-models GitHub repository. The model's training focuses on generating high-quality and high-resolution images with efficient text-to-image alignment.
Guide: Running Locally
- Clone Repository: Start by cloning the Sana repository from GitHub:
git clone https://github.com/NVlabs/Sana
. - Setup Environment: Install the necessary dependencies as specified in the
requirements.txt
. - Download Model Weights: Obtain the model weights from the Hugging Face model hub or the GitHub repository.
- Run Inference: Use a Python script to load the model and generate images based on text prompts.
Suggestion: For optimal performance, consider using cloud GPU services, such as those offered by AWS, Google Cloud, or Azure.
License
Sana is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This license allows for adaptation and sharing under similar terms, provided it is not used for commercial purposes.