Introduction

VGen is an open-source video synthesis codebase developed by the Tongyi Lab of Alibaba Group, featuring advanced video generative models. It supports multiple methods for video synthesis, including I2VGen-XL and VideoComposer, among others. VGen facilitates high-quality video generation from text, images, and other inputs with a variety of video generation tools.

Architecture

VGen's architecture is designed for expandability, completeness, and excellent performance. It includes powerful pre-trained models for various tasks, supporting video generation with state-of-the-art capabilities. The codebase is modular, allowing easy management of experiments by integrating components like ENGINE, MODEL, DATASETS, and more.

Training

To train a text-to-video model using VGen, users should execute distributed training commands as specified. Configuration files like t2v_train.yaml enable customization of data and diffusion settings. Pre-trained models can be used for initialization, and results are saved for review. After training, inference can be performed to generate videos.

Guide: Running Locally

  1. Installation:

    conda create -n vgen python=3.8
    conda activate vgen
    pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113
    pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
    
  2. Clone Repository:

    git clone https://github.com/damo-vilab/i2vgen-xl.git
    cd i2vgen-xl
    
  3. Dataset: Utilize the provided demo dataset for testing.

  4. Training:

    python train_net.py --cfg configs/t2v_train.yaml
    
  5. Inference:

    python inference.py --cfg configs/t2v_infer.yaml
    
  6. Running I2VGen-XL:

    • Download the model and test data:
      !pip install modelscope
      from modelscope.hub.snapshot_download import snapshot_download
      model_dir = snapshot_download('damo/I2VGen-XL', cache_dir='models/', revision='v1.0.0')
      
    • Execute:
      python inference.py --cfg configs/i2vgen_xl_infer.yaml
      

Cloud GPUs: For optimal performance, utilizing cloud GPUs such as those from AWS or Google Cloud is recommended.

License

The project is licensed under the MIT License. The model is intended for research and non-commercial use only, utilizing datasets like WebVid-10M and LAION-400M.

More Related APIs in Text To Video