Pix Art X L 2 1024 M S

PixArt-alpha

PixArt-α Model Documentation

Introduction

PixArt-α is a diffusion-transformer-based text-to-image generative model capable of producing high-resolution images from textual prompts. The model, designed for efficiency, requires significantly less training time than comparable models, while maintaining competitive performance.

Architecture

PixArt-α utilizes pure transformer blocks within a latent diffusion framework, allowing it to generate 1024px images in a single sampling process. Key components include:

  • Text Encoder: Pretrained T5 model.
  • Latent Feature Encoder: Variational Autoencoder (VAE).
  • Transformer Latent Diffusion Model: Facilitates the image generation process.

Training

The model achieves impressive efficiency, taking only 10.8% of the time required by Stable Diffusion v1.5, with substantial cost and CO2 savings. Training uses 675 A100 GPU days compared to the 6,250 required by Stable Diffusion v1.5.

Guide: Running Locally

To run PixArt-α on a local machine, follow these steps:

  1. Install Required Libraries:
    pip install -U diffusers transformers accelerate safetensors sentencepiece
    
  2. Load the Model:
    from diffusers import PixArtAlphaPipeline
    import torch
    
    pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)
    pipe = pipe.to("cuda")
    
  3. Generate Images:
    prompt = "An astronaut riding a green horse"
    images = pipe(prompt=prompt).images[0]
    
  4. Optimize for Inference: Use torch.compile for faster inference if using torch >= 2.0.
  5. CPU Offloading: Enable model CPU offloading if limited by GPU VRAM.
  6. Cloud GPU Recommendation: Utilize cloud services like Google Colab for free access to GPUs here.

License

PixArt-α is distributed under the CreativeML Open RAIL++-M License. This license allows for research and educational use, but certain uses such as generating harmful content are prohibited.

More Related APIs in Text To Image