mochi 1 preview

genmo

Introduction

Mochi 1 is an advanced open-source video generation model developed by Genmo. It features high-fidelity motion and strong prompt adherence, representing a significant improvement in open video generation systems. The model is available under the Apache 2.0 license and can be freely experimented with on Genmo's playground.

Architecture

Mochi 1 is built on the Asymmetric Diffusion Transformer (AsymmDiT) architecture with 10 billion parameters, making it the largest video generative model publicly available. It includes an inference harness for efficient context parallel implementation. The model employs an asymmetric encoder-decoder structure called AsymmVAE for effective video compression.

AsymmVAE Model Specs

  • Params Count: 362M
  • Enc Base Channels: 64
  • Dec Base Channels: 128
  • Latent Dim: 12
  • Spatial Compression: 8x8
  • Temporal Compression: 6x

AsymmDiT Model Specs

  • Params Count: 10B
  • Num Layers: 48
  • Num Heads: 24
  • Visual Dim: 3072
  • Text Dim: 1536
  • Visual Tokens: 44520
  • Text Tokens: 256

Training

Mochi 1's training process involves using a single T5-XXL language model for encoding prompts, unlike many modern diffusion models that require multiple pretrained language models. The architecture's asymmetric design optimizes memory use while focusing on both text and visual tokens through multi-modal self-attention mechanisms.

Guide: Running Locally

To run Mochi 1 locally, follow these steps:

  1. Installation: Clone the repository and set up a virtual environment using uv.

    git clone https://github.com/genmoai/models
    cd models
    pip install uv
    uv venv .venv
    source .venv/bin/activate
    uv pip install setuptools
    uv pip install -e . --no-build-isolation
    
  2. Download Weights: Use download_weights.py to download the model and decoder.

    python3 ./scripts/download_weights.py <path_to_downloaded_directory>
    
  3. Run the Model:

    • Start the Gradio UI:
      python3 ./demos/gradio_ui.py --model_dir "<path_to_downloaded_directory>"
      
    • Or generate videos directly:
      python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>"
      
  4. Cloud GPUs: For optimal performance, it's recommended to use at least 1 H100 GPU, as running on a single GPU requires about 60GB VRAM.

License

Mochi 1 is released under the Apache 2.0 license, which is a permissive license allowing users to freely use, modify, and distribute the software.

More Related APIs in Text To Video