Introduction

Sana is an advanced text-to-image generative model developed by NVIDIA, capable of synthesizing high-resolution images with strong text-image alignment. It can produce images up to 4096 × 4096 resolution efficiently and is designed to operate on consumer-grade hardware like a laptop GPU.

Architecture

Sana is built on a Linear-Diffusion-Transformer framework, incorporating a fixed pretrained text encoder (Gemma2-2B-IT) and a 32x spatial-compressed latent feature encoder (DC-AE). The model comprises 1648 million parameters and can generate images based on 1024px resolution with multi-scale height and width.

Training

Sana's development involved integrating advanced diffusion samplers like Flow-DPM-Solver, accessible through the generative-models GitHub repository. The model's training focuses on generating high-quality and high-resolution images with efficient text-to-image alignment.

Guide: Running Locally

  1. Clone Repository: Start by cloning the Sana repository from GitHub: git clone https://github.com/NVlabs/Sana.
  2. Setup Environment: Install the necessary dependencies as specified in the requirements.txt.
  3. Download Model Weights: Obtain the model weights from the Hugging Face model hub or the GitHub repository.
  4. Run Inference: Use a Python script to load the model and generate images based on text prompts.

Suggestion: For optimal performance, consider using cloud GPU services, such as those offered by AWS, Google Cloud, or Azure.

License

Sana is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This license allows for adaptation and sharing under similar terms, provided it is not used for commercial purposes.

More Related APIs in Text To Image