Sana: Efficient Text-to-Image Model

Introduction

Sana is a state-of-the-art text-to-image generation framework capable of producing high-resolution images up to 4096 × 4096 pixels. It is optimized for fast processing and deployable on laptop GPUs, making it accessible for various applications.

Architecture

The model is a Linear-Diffusion-Transformer-based generative model with 1648M parameters. It utilizes the Gemma2-2B-IT text encoder and a 32x spatial-compressed latent feature encoder, DC-AE. Sana supports multilingual prompts, including English and Chinese, and can interpret emojis.

Training

Sana is fine-tuned from its base model, Efficient-Large-Model/Sana_1600M_1024px, to enhance its multilingual and mixed-prompt capabilities. It emphasizes efficient image synthesis with strong text-image alignment.

Guide: Running Locally

Clone the Repository:

git clone https://github.com/NVlabs/Sana

Install Dependencies: Navigate to the repository and install necessary Python packages.
```
cd Sana
pip install -r requirements.txt
```
Run the Model: Execute the model script to generate images from text prompts.
```
python run_sana.py --prompt "Your text prompt here"
```
Cloud GPUs: For enhanced performance, consider using cloud services like AWS or Google Cloud to access powerful GPUs.

License

Sana is released under the CC BY-NC-SA 4.0 License, which permits non-commercial use and sharing with attribution. For more details, refer to the license file.