Pix Art X L 2 1024 M S
PixArt-alphaPixArt-α Model Documentation
Introduction
PixArt-α is a diffusion-transformer-based text-to-image generative model capable of producing high-resolution images from textual prompts. The model, designed for efficiency, requires significantly less training time than comparable models, while maintaining competitive performance.
Architecture
PixArt-α utilizes pure transformer blocks within a latent diffusion framework, allowing it to generate 1024px images in a single sampling process. Key components include:
- Text Encoder: Pretrained T5 model.
- Latent Feature Encoder: Variational Autoencoder (VAE).
- Transformer Latent Diffusion Model: Facilitates the image generation process.
Training
The model achieves impressive efficiency, taking only 10.8% of the time required by Stable Diffusion v1.5, with substantial cost and CO2 savings. Training uses 675 A100 GPU days compared to the 6,250 required by Stable Diffusion v1.5.
Guide: Running Locally
To run PixArt-α on a local machine, follow these steps:
- Install Required Libraries:
pip install -U diffusers transformers accelerate safetensors sentencepiece
- Load the Model:
from diffusers import PixArtAlphaPipeline import torch pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16) pipe = pipe.to("cuda")
- Generate Images:
prompt = "An astronaut riding a green horse" images = pipe(prompt=prompt).images[0]
- Optimize for Inference: Use
torch.compile
for faster inference if using torch >= 2.0. - CPU Offloading: Enable model CPU offloading if limited by GPU VRAM.
- Cloud GPU Recommendation: Utilize cloud services like Google Colab for free access to GPUs here.
License
PixArt-α is distributed under the CreativeML Open RAIL++-M License. This license allows for research and educational use, but certain uses such as generating harmful content are prohibited.