Sana_1600 M_2 Kpx_ B F16

Efficient-Large-Model

Introduction

Sana is a text-to-image framework designed for efficient image generation up to 4096 × 4096 resolution. It allows for high-resolution, high-quality image synthesis with robust text-image alignment, capable of running on a laptop GPU. The source code is available on GitHub.

Architecture

Sana employs a Linear-Diffusion-Transformer model with 1648M parameters, optimized for 2Kpx image generation. It utilizes a pretrained text encoder (Gemma2-2B-IT) and a spatial-compressed latent feature encoder (DC-AE). The model is fine-tuned from a base model, supporting mixed prompts in Emoji, Chinese, and English.

Training

Training details are not specified, but the model can be fine-tuned for enhanced capabilities. It currently shows limitations in complex scene generation and rendering realistic human features such as hands.

Guide: Running Locally

  1. Setup Environment: Ensure you have PyTorch and required libraries installed.
  2. Clone Repository: Obtain the code from GitHub.
  3. Prepare Model: Load the model using the provided .pth file.
  4. Run Inference: Utilize the SanaPipeline to generate images from text prompts.

Suggested Cloud GPUs

For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure with CUDA support.

License

Sana is distributed under the CC BY-NC-SA 4.0 License, which allows for non-commercial use with attribution and share-alike conditions.

More Related APIs in Text To Image