Introduction

Allegro by RHYMES-AI is a text-to-video generation model designed to produce high-quality video content. It is open-source and available under the Apache 2.0 license.

Architecture

Allegro incorporates a 175M parameter VideoVAE and a 2.8B parameter VideoDiT model. It supports multiple precisions (FP32, BF16, FP16) and operates efficiently with 9.3 GB of GPU memory in BF16 mode using CPU offloading. The model can generate 6-second videos at 15 FPS with a resolution of 720x1280, which can be interpolated to 30 FPS using EMA-VFI.

Training

Allegro is trained to handle a variety of content, including dynamic scenes and close-ups of humans and animals. The model utilizes a large context length of 79.2K, equivalent to 88 frames, to produce detailed and versatile video outputs.

Guide: Running Locally

  1. Install Requirements:
    Ensure Python >= 3.10, PyTorch >= 2.4, and CUDA >= 12.4 are installed. Use Anaconda to create a new environment:

    conda create -n allegro python=3.10 -y
    

    Install necessary packages:

    pip install git+https://github.com/huggingface/diffusers.git torch==2.4.1 transformers==4.40.1 accelerate sentencepiece imageio imageio-ffmpeg beautifulsoup4
    
  2. Run Inference:
    Import necessary modules and load the model:

    import torch
    from diffusers import AutoencoderKLAllegro, AllegroPipeline
    from diffusers.utils import export_to_video
    
    vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32)
    pipe = AllegroPipeline.from_pretrained("rhymes-ai/Allegro", vae=vae, torch_dtype=torch.bfloat16)
    pipe.to("cuda")
    pipe.vae.enable_tiling()
    
    prompt = "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water."
    video = pipe(prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100).frames[0]
    export_to_video(video, "output.mp4", fps=15)
    

    For reduced GPU memory usage, use pipe.enable_sequential_cpu_offload(), though this increases inference time.

  3. Interpolate Video:
    Use EMA-VFI to interpolate videos to 30 FPS for enhanced quality.

  4. Faster Inference:
    Explore options like Context Parallel and PAB for faster processing on GitHub.

Cloud GPUs: Consider using cloud GPU services like AWS, Google Cloud, or Azure for efficient processing.

License

This project is licensed under the Apache 2.0 License.

More Related APIs in Text To Video