Cog Video X1.5 5 B

THUDM

Introduction

CogVideoX is an open-source video generation model developed by the Knowledge Engineering Group (KEG) at Tsinghua University. It is designed to transform text prompts into video content using advanced diffusion techniques. The model operates with a variety of inference precisions and supports English prompts.

Architecture

CogVideoX1.5-5B is capable of generating videos at a resolution of 1360x768 and supports a video length of 5 to 10 seconds with a frame rate of 16 frames per second. It requires a minimum of 9GB of memory for single GPU inference with BF16 precision and 24GB for multi-GPU setups. The inference speed varies with hardware, with a single NVIDIA A100 taking approximately 1000 seconds for a 5-second video.

Training

The model supports several optimizations to improve performance on NVIDIA Ampere architecture GPUs or higher. These include sequential CPU offloading and various VAE optimizations. Quantization techniques using PytorchAO and Optimum-quanto can reduce memory requirements, making it feasible to run the model on GPUs with lower VRAM.

Guide: Running Locally

To run CogVideoX locally, follow these steps:

  1. Install Dependencies:

    pip install git+https://github.com/huggingface/diffusers
    pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
    
  2. Run the Code:

    import torch
    from diffusers import CogVideoXPipeline
    from diffusers.utils import export_to_video
    
    prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest..."
    
    pipe = CogVideoXPipeline.from_pretrained(
        "THUDM/CogVideoX1.5-5B",
        torch_dtype=torch.bfloat16
    )
    
    pipe.enable_sequential_cpu_offload()
    pipe.vae.enable_tiling()
    pipe.vae.enable_slicing()
    
    video = pipe(
        prompt=prompt,
        num_videos_per_prompt=1,
        num_inference_steps=50,
        num_frames=81,
        guidance_scale=6,
        generator=torch.Generator(device="cuda").manual_seed(42),
    ).frames[0]
    
    export_to_video(video, "output.mp4", fps=8)
    

    Using cloud GPUs like NVIDIA A100 or H100 is recommended for optimal performance.

License

This model is released under the CogVideoX LICENSE, accessible here.

More Related APIs in Image To Video