Sana_1600 M_1024px_ Multi Ling
Efficient-Large-ModelSana: Efficient Text-to-Image Model
Introduction
Sana is a state-of-the-art text-to-image generation framework capable of producing high-resolution images up to 4096 × 4096 pixels. It is optimized for fast processing and deployable on laptop GPUs, making it accessible for various applications.
Architecture
The model is a Linear-Diffusion-Transformer-based generative model with 1648M parameters. It utilizes the Gemma2-2B-IT text encoder and a 32x spatial-compressed latent feature encoder, DC-AE. Sana supports multilingual prompts, including English and Chinese, and can interpret emojis.
Training
Sana is fine-tuned from its base model, Efficient-Large-Model/Sana_1600M_1024px, to enhance its multilingual and mixed-prompt capabilities. It emphasizes efficient image synthesis with strong text-image alignment.
Guide: Running Locally
- Clone the Repository:
git clone https://github.com/NVlabs/Sana
- Install Dependencies:
Navigate to the repository and install necessary Python packages.
cd Sana pip install -r requirements.txt
- Run the Model:
Execute the model script to generate images from text prompts.
python run_sana.py --prompt "Your text prompt here"
- Cloud GPUs: For enhanced performance, consider using cloud services like AWS or Google Cloud to access powerful GPUs.
License
Sana is released under the CC BY-NC-SA 4.0 License, which permits non-commercial use and sharing with attribution. For more details, refer to the license file.