Introduction

Tora is a trajectory-oriented diffusion transformer framework designed for video generation. It incorporates textual, visual, and trajectory conditions, enabling the creation of videos with controllable motion. The framework includes a Trajectory Extractor, a Spatial-Temporal DiT, and a Motion-guidance Fuser, facilitating the integration of motion patches to produce videos with high motion fidelity.

Architecture

Tora's architecture comprises three main components:

  • Trajectory Extractor (TE): Encodes trajectories into hierarchical spacetime motion patches using a 3D video compression network.
  • Spatial-Temporal DiT: Responsible for handling the encoded motion patches.
  • Motion-guidance Fuser (MGF): Integrates motion patches into DiT blocks to ensure consistent video generation following the specified trajectories.

This design supports scalability and precise control over video dynamics, including various durations, aspect ratios, and resolutions.

Training

The Tora framework, including its text-to-video training code, has been released for public use. It leverages the capabilities of Diffusion Transformers to enhance motion fidelity in video generation tasks. The training process has been optimized to reduce VRAM requirements and improve inference speed.

Guide: Running Locally

To run Tora locally, follow these steps:

  1. Clone the Repository:
    git clone https://github.com/alibaba/Tora
    cd Tora
    
  2. Install Dependencies: Set up the required environment by installing dependencies listed in the requirements.txt file:
    pip install -r requirements.txt
    
  3. Download Model Weights: Obtain the model weights from the Hugging Face or ModelScope repositories.
  4. Run Inference: Follow the instructions in the README.md or the dedicated inference guide to execute the model and generate videos.

Cloud GPUs

For optimal performance, especially for large-scale video generation tasks, consider using cloud-based GPU solutions, such as AWS EC2 instances with NVIDIA A100 or V100 GPUs.

License

The Tora framework is released under an unspecified license categorized as "other." Users should refer to the project's repository for detailed licensing information and terms of use.

More Related APIs in Text To Video