CogVideoX-5B

Introduction

CogVideoX-5B is an advanced text-to-video generation model developed by the Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University. It is an open-source version, providing high-quality video generation and visual effects. The model supports English prompts and delivers outputs with a resolution of 720x480 pixels.

Architecture

CogVideoX-5B utilizes a diffusion model architecture that includes a text encoder, a Transformer, and a Variational Autoencoder (VAE). The model supports various inference precisions, such as BF16, FP16, and FP32, and can be optimized for VRAM efficiency using diffusers and quantization techniques.

Training

The model is trained using BF16 precision, which is recommended for inference. It supports fine-tuning with techniques like LORA and SFT, requiring substantial VRAM. The model's prompt length is limited to 226 tokens, generating videos that are 6 seconds long at 8 frames per second.

Guide: Running Locally

Install Dependencies:

pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

Run the Code:

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest..."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Use Cloud GPUs:
Consider using cloud resources like NVIDIA A100 or H100 GPUs for efficient processing, especially for large-scale or intensive tasks.

License