Cog Video X 5b

THUDM

CogVideoX-5B

Introduction

CogVideoX-5B is an advanced text-to-video generation model developed by the Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University. It is an open-source version, providing high-quality video generation and visual effects. The model supports English prompts and delivers outputs with a resolution of 720x480 pixels.

Architecture

CogVideoX-5B utilizes a diffusion model architecture that includes a text encoder, a Transformer, and a Variational Autoencoder (VAE). The model supports various inference precisions, such as BF16, FP16, and FP32, and can be optimized for VRAM efficiency using diffusers and quantization techniques.

Training

The model is trained using BF16 precision, which is recommended for inference. It supports fine-tuning with techniques like LORA and SFT, requiring substantial VRAM. The model's prompt length is limited to 226 tokens, generating videos that are 6 seconds long at 8 frames per second.

Guide: Running Locally

  1. Install Dependencies:

    pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
    
  2. Run the Code:

    import torch
    from diffusers import CogVideoXPipeline
    from diffusers.utils import export_to_video
    
    prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest..."
    
    pipe = CogVideoXPipeline.from_pretrained(
        "THUDM/CogVideoX-5b",
        torch_dtype=torch.bfloat16
    )
    
    pipe.enable_model_cpu_offload()
    pipe.vae.enable_tiling()
    
    video = pipe(
        prompt=prompt,
        num_videos_per_prompt=1,
        num_inference_steps=50,
        num_frames=49,
        guidance_scale=6,
        generator=torch.Generator(device="cuda").manual_seed(42),
    ).frames[0]
    
    export_to_video(video, "output.mp4", fps=8)
    
  3. Use Cloud GPUs:
    Consider using cloud resources like NVIDIA A100 or H100 GPUs for efficient processing, especially for large-scale or intensive tasks.

License

The model is released under the CogVideoX LICENSE. For more details, refer to the license document.

More Related APIs in Text To Video